iv. Numeric data wrangling¶
For text preprocessing, see the January 19 notebook.
The pandas library¶
Import the pandas library with the alias pd.
Use “dot notation” to apply its methods to the dataset, which is stored in a tabular .csv file.
To import the .csv file, use the .read_csv()
pandas method. The only argument is the file path to a .csv file.
Import the data below and investigate it by applying pandas methods to the data frame gap
.
Learn more about the Gapminder data
import pandas as pd
gap = pd.read_csv("data/gapminder-FiveYearData.csv")
print(type(gap))
<class 'pandas.core.frame.DataFrame'>
Pandas methods¶
Just a small handful of pandas methods will help you accomplish several key data wrangling tasks. See the examplea code below.
First, look at the data
Summary statistics
Subset rows:
first row
first three rows
rows 10 thru 14
Subset columns:
one column
multiple columns
Row and column subset
Subset by logical condition(s)
First, look at the data¶
Use pandas methods to learn about the data!
.head()
¶
# look at first five rows by default
gap.head()
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | 1952 | 8425333.0 | Asia | 28.801 | 779.445314 |
1 | Afghanistan | 1957 | 9240934.0 | Asia | 30.332 | 820.853030 |
2 | Afghanistan | 1962 | 10267083.0 | Asia | 31.997 | 853.100710 |
3 | Afghanistan | 1967 | 11537966.0 | Asia | 34.020 | 836.197138 |
4 | Afghanistan | 1972 | 13079460.0 | Asia | 36.088 | 739.981106 |
.columns
¶
# View column names
# this is an attribute - note the lack of parentheses ()
gap.columns
Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')
.shape
¶
# Show number of rows by columns
# this is an attribute - note the lack of parentheses ()
gap.shape
(1704, 6)
Summary statistics¶
Produce summary statistics, including:
Count, mean, sd, quartiles, min/max
Tabulate frequencies
.describe()
¶
# produce summary statistics for numeric data
gap.describe()
year | pop | lifeExp | gdpPercap | |
---|---|---|---|---|
count | 1704.00000 | 1.704000e+03 | 1704.000000 | 1704.000000 |
mean | 1979.50000 | 2.960121e+07 | 59.474439 | 7215.327081 |
std | 17.26533 | 1.061579e+08 | 12.917107 | 9857.454543 |
min | 1952.00000 | 6.001100e+04 | 23.599000 | 241.165876 |
25% | 1965.75000 | 2.793664e+06 | 48.198000 | 1202.060309 |
50% | 1979.50000 | 7.023596e+06 | 60.712500 | 3531.846988 |
75% | 1993.25000 | 1.958522e+07 | 70.845500 | 9325.462346 |
max | 2007.00000 | 1.318683e+09 | 82.603000 | 113523.132900 |
.mean()
and .std()
¶
# calculate mean and standard deviation of lifeExp
lifeExp_mean = gap["lifeExp"].mean()
lifeExp_sd = gap["lifeExp"].std()
print(lifeExp_mean)
print(lifeExp_sd)
59.47443936619714
12.917107415241187
.groupby()
and .count()
¶
# count the number of countries grouped by each continent
gap.groupby("continent").count()["country"]
continent
Africa 624
Americas 300
Asia 396
Europe 360
Oceania 24
Name: country, dtype: int64
Subset rows or columns¶
Sampling data is necessary for many reasons, including quick sanity checks.
Slice a data frame by using bracket notation to specify start and end points [start : end]
The [start
index is included and the end]
index is excluded.
Remember that Python is a zero-indexed language, or starts counting from zero, not one.
Leave the start or end values blank to start from the beginning, or go to the end of a collection.
Row subset: slice just the first row¶
gap[:1]
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | 1952 | 8425333.0 | Asia | 28.801 | 779.445314 |
Row subset: slice first three rows¶
gap[:3]
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | 1952 | 8425333.0 | Asia | 28.801 | 779.445314 |
1 | Afghanistan | 1957 | 9240934.0 | Asia | 30.332 | 820.853030 |
2 | Afghanistan | 1962 | 10267083.0 | Asia | 31.997 | 853.100710 |
Row subset: slice rows 10 thru 14¶
subset1 = gap[10:15]
subset1
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
10 | Afghanistan | 2002 | 25268405.0 | Asia | 42.129 | 726.734055 |
11 | Afghanistan | 2007 | 31889923.0 | Asia | 43.828 | 974.580338 |
12 | Albania | 1952 | 1282697.0 | Europe | 55.230 | 1601.056136 |
13 | Albania | 1957 | 1476505.0 | Europe | 59.280 | 1942.284244 |
14 | Albania | 1962 | 1728137.0 | Europe | 64.820 | 2312.888958 |
Column subset: one column¶
# type the column name as a string in square brackets
gap['lifeExp']
0 28.801
1 30.332
2 31.997
3 34.020
4 36.088
...
1699 62.351
1700 60.377
1701 46.809
1702 39.989
1703 43.487
Name: lifeExp, Length: 1704, dtype: float64
Column subset: multiple columns¶
# note the double sets of brackets
subset2 = gap[['continent', 'lifeExp', 'gdpPercap']]
subset2
continent | lifeExp | gdpPercap | |
---|---|---|---|
0 | Asia | 28.801 | 779.445314 |
1 | Asia | 30.332 | 820.853030 |
2 | Asia | 31.997 | 853.100710 |
3 | Asia | 34.020 | 836.197138 |
4 | Asia | 36.088 | 739.981106 |
... | ... | ... | ... |
1699 | Africa | 62.351 | 706.157306 |
1700 | Africa | 60.377 | 693.420786 |
1701 | Africa | 46.809 | 792.449960 |
1702 | Africa | 39.989 | 672.038623 |
1703 | Africa | 43.487 | 469.709298 |
1704 rows × 3 columns
Row and column subset¶
# subset more than one column and rows 855 thru 858
subset3 = gap[['continent', 'lifeExp', 'gdpPercap']][855:859]
subset3
continent | lifeExp | gdpPercap | |
---|---|---|---|
855 | Asia | 64.624 | 80894.88326 |
856 | Asia | 67.712 | 109347.86700 |
857 | Asia | 69.343 | 59265.47714 |
858 | Asia | 71.309 | 31354.03573 |
type(gap["lifeExp"])
pandas.core.series.Series
Subset by logical condition(s)¶
# lifeExp is greater than 80
le2 = gap[gap['lifeExp'] > 81]
le2
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
71 | Australia | 2007 | 20434176.0 | Oceania | 81.235 | 34435.36744 |
670 | Hong Kong China | 2002 | 6762476.0 | Asia | 81.495 | 30209.01516 |
671 | Hong Kong China | 2007 | 6980412.0 | Asia | 82.208 | 39724.97867 |
695 | Iceland | 2007 | 301931.0 | Europe | 81.757 | 36180.78919 |
802 | Japan | 2002 | 127065841.0 | Asia | 82.000 | 28604.59190 |
803 | Japan | 2007 | 127467972.0 | Asia | 82.603 | 31656.06806 |
1487 | Switzerland | 2007 | 7554661.0 | Europe | 81.701 | 37506.41907 |
# logical AND (all conditions must be satisfied to be included)
# create subset that includes life expectancy greater than 81 AND pop < 500,000.
year2002 = gap[(gap["lifeExp"] > 81) & (gap["pop"] < 500000)]
year2002
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
695 | Iceland | 2007 | 301931.0 | Europe | 81.757 | 36180.78919 |
# logical OR (one of multiple conditions must be satisfied to be included)
# create subset that includes country equals Ireland OR life expectancy greater than 82.
ireland82 = gap[(gap["country"] == "Ireland") | (gap["lifeExp"] > 82)]
ireland82
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
671 | Hong Kong China | 2007 | 6980412.0 | Asia | 82.208 | 39724.978670 |
744 | Ireland | 1952 | 2952156.0 | Europe | 66.910 | 5210.280328 |
745 | Ireland | 1957 | 2878220.0 | Europe | 68.900 | 5599.077872 |
746 | Ireland | 1962 | 2830000.0 | Europe | 70.290 | 6631.597314 |
747 | Ireland | 1967 | 2900100.0 | Europe | 71.080 | 7655.568963 |
748 | Ireland | 1972 | 3024400.0 | Europe | 71.280 | 9530.772896 |
749 | Ireland | 1977 | 3271900.0 | Europe | 72.030 | 11150.981130 |
750 | Ireland | 1982 | 3480000.0 | Europe | 73.100 | 12618.321410 |
751 | Ireland | 1987 | 3539900.0 | Europe | 74.360 | 13872.866520 |
752 | Ireland | 1992 | 3557761.0 | Europe | 75.467 | 17558.815550 |
753 | Ireland | 1997 | 3667233.0 | Europe | 76.122 | 24521.947130 |
754 | Ireland | 2002 | 3879155.0 | Europe | 77.783 | 34077.049390 |
755 | Ireland | 2007 | 4109086.0 | Europe | 78.885 | 40675.996350 |
803 | Japan | 2007 | 127467972.0 | Asia | 82.603 | 31656.068060 |