iv. Numeric data wrangling
Contents
iv. Numeric data wrangling#
For text preprocessing, see the January 19 notebook.
The pandas library#
Import the pandas library with the alias pd.
Use “dot notation” to apply its methods to the dataset, which is stored in a tabular .csv file.
To import the .csv file, use the .read_csv()
pandas method. The only argument is the file path to a .csv file.
Import the data below and investigate it by applying pandas methods to the data frame gap
.
Learn more about the Gapminder data
import pandas as pd
gap = pd.read_csv("data/gapminder-FiveYearData.csv")
print(type(gap))
<class 'pandas.core.frame.DataFrame'>
Pandas methods#
Just a small handful of pandas methods will help you accomplish several key data wrangling tasks. See the examplea code below.
First, look at the data
Summary statistics
Subset rows:
first row
first three rows
rows 10 thru 14
Subset columns:
one column
multiple columns
Row and column subset
Subset by logical condition(s)
First, look at the data#
Use pandas methods to learn about the data!
.head()
#
# look at first five rows by default
gap.head()
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | 1952 | 8425333.0 | Asia | 28.801 | 779.445314 |
1 | Afghanistan | 1957 | 9240934.0 | Asia | 30.332 | 820.853030 |
2 | Afghanistan | 1962 | 10267083.0 | Asia | 31.997 | 853.100710 |
3 | Afghanistan | 1967 | 11537966.0 | Asia | 34.020 | 836.197138 |
4 | Afghanistan | 1972 | 13079460.0 | Asia | 36.088 | 739.981106 |
.columns
#
# View column names
# this is an attribute instead of a method - note the lack of parentheses ()
gap.columns
Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')
.shape
#
# Show number of rows by columns
# this is an attribute instead of a method - note the lack of parentheses ()
gap.shape
(1704, 6)
Summary statistics#
Produce summary statistics, including:
Count, mean, sd, quartiles, min/max
Tabulate frequencies
.describe()
#
# produce summary statistics for numeric data
gap.describe()
year | pop | lifeExp | gdpPercap | |
---|---|---|---|---|
count | 1704.00000 | 1.704000e+03 | 1704.000000 | 1704.000000 |
mean | 1979.50000 | 2.960121e+07 | 59.474439 | 7215.327081 |
std | 17.26533 | 1.061579e+08 | 12.917107 | 9857.454543 |
min | 1952.00000 | 6.001100e+04 | 23.599000 | 241.165876 |
25% | 1965.75000 | 2.793664e+06 | 48.198000 | 1202.060309 |
50% | 1979.50000 | 7.023596e+06 | 60.712500 | 3531.846988 |
75% | 1993.25000 | 1.958522e+07 | 70.845500 | 9325.462346 |
max | 2007.00000 | 1.318683e+09 | 82.603000 | 113523.132900 |
.mean()
and .std()
#
# calculate mean and standard deviation of lifeExp
lifeExp_mean = gap["lifeExp"].mean()
lifeExp_sd = gap["lifeExp"].std()
print(lifeExp_mean)
print(lifeExp_sd)
59.474439366197174
12.917107415241192
.groupby()
and .count()
#
# count the number of countries grouped by each continent
gap.groupby("continent").count()["country"]
continent
Africa 624
Americas 300
Asia 396
Europe 360
Oceania 24
Name: country, dtype: int64
Subset rows or columns#
Sampling data is necessary for many reasons, including quick sanity checks.
Slice a data frame by using bracket notation to specify start and end points [start : end]
The [start
index is included and the end]
index is excluded.
Remember that Python is a zero-indexed language, so starts counting from zero, not one.
Leave the start or end values blank to start from the beginning, or go to the end of a collection.
Row subset: slice just the first row#
gap[:1]
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | 1952 | 8425333.0 | Asia | 28.801 | 779.445314 |
Row subset: slice first three rows#
gap[:3]
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | 1952 | 8425333.0 | Asia | 28.801 | 779.445314 |
1 | Afghanistan | 1957 | 9240934.0 | Asia | 30.332 | 820.853030 |
2 | Afghanistan | 1962 | 10267083.0 | Asia | 31.997 | 853.100710 |
Row subset: slice rows 10 thru 14#
subset1 = gap[10:15]
subset1
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
10 | Afghanistan | 2002 | 25268405.0 | Asia | 42.129 | 726.734055 |
11 | Afghanistan | 2007 | 31889923.0 | Asia | 43.828 | 974.580338 |
12 | Albania | 1952 | 1282697.0 | Europe | 55.230 | 1601.056136 |
13 | Albania | 1957 | 1476505.0 | Europe | 59.280 | 1942.284244 |
14 | Albania | 1962 | 1728137.0 | Europe | 64.820 | 2312.888958 |
Column subset: one column#
# type the column name as a string in square brackets
gap['lifeExp']
0 28.801
1 30.332
2 31.997
3 34.020
4 36.088
...
1699 62.351
1700 60.377
1701 46.809
1702 39.989
1703 43.487
Name: lifeExp, Length: 1704, dtype: float64
Column subset: multiple columns#
# note the double sets of brackets
subset2 = gap[['continent', 'lifeExp', 'gdpPercap']]
subset2
continent | lifeExp | gdpPercap | |
---|---|---|---|
0 | Asia | 28.801 | 779.445314 |
1 | Asia | 30.332 | 820.853030 |
2 | Asia | 31.997 | 853.100710 |
3 | Asia | 34.020 | 836.197138 |
4 | Asia | 36.088 | 739.981106 |
... | ... | ... | ... |
1699 | Africa | 62.351 | 706.157306 |
1700 | Africa | 60.377 | 693.420786 |
1701 | Africa | 46.809 | 792.449960 |
1702 | Africa | 39.989 | 672.038623 |
1703 | Africa | 43.487 | 469.709298 |
1704 rows × 3 columns
Row and column subset#
# subset more than one column and rows 855 thru 858
subset3 = gap[['continent', 'lifeExp', 'gdpPercap']][855:859]
subset3
continent | lifeExp | gdpPercap | |
---|---|---|---|
855 | Asia | 64.624 | 80894.88326 |
856 | Asia | 67.712 | 109347.86700 |
857 | Asia | 69.343 | 59265.47714 |
858 | Asia | 71.309 | 31354.03573 |
type(gap["lifeExp"])
pandas.core.series.Series
Subset by logical condition(s)#
# lifeExp is greater than 80
le2 = gap[gap['lifeExp'] > 81]
le2
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
71 | Australia | 2007 | 20434176.0 | Oceania | 81.235 | 34435.36744 |
670 | Hong Kong China | 2002 | 6762476.0 | Asia | 81.495 | 30209.01516 |
671 | Hong Kong China | 2007 | 6980412.0 | Asia | 82.208 | 39724.97867 |
695 | Iceland | 2007 | 301931.0 | Europe | 81.757 | 36180.78919 |
802 | Japan | 2002 | 127065841.0 | Asia | 82.000 | 28604.59190 |
803 | Japan | 2007 | 127467972.0 | Asia | 82.603 | 31656.06806 |
1487 | Switzerland | 2007 | 7554661.0 | Europe | 81.701 | 37506.41907 |
# logical AND (all conditions must be satisfied to be included)
# create subset that includes life expectancy greater than 81 AND pop < 500,000.
year2002 = gap[(gap["lifeExp"] > 81) & (gap["pop"] < 500000)]
year2002
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
695 | Iceland | 2007 | 301931.0 | Europe | 81.757 | 36180.78919 |
# logical OR (one of multiple conditions must be satisfied to be included)
# create subset that includes country equals Ireland OR life expectancy greater than 82.
ireland82 = gap[(gap["country"] == "Ireland") | (gap["lifeExp"] > 82)]
ireland82
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
671 | Hong Kong China | 2007 | 6980412.0 | Asia | 82.208 | 39724.978670 |
744 | Ireland | 1952 | 2952156.0 | Europe | 66.910 | 5210.280328 |
745 | Ireland | 1957 | 2878220.0 | Europe | 68.900 | 5599.077872 |
746 | Ireland | 1962 | 2830000.0 | Europe | 70.290 | 6631.597314 |
747 | Ireland | 1967 | 2900100.0 | Europe | 71.080 | 7655.568963 |
748 | Ireland | 1972 | 3024400.0 | Europe | 71.280 | 9530.772896 |
749 | Ireland | 1977 | 3271900.0 | Europe | 72.030 | 11150.981130 |
750 | Ireland | 1982 | 3480000.0 | Europe | 73.100 | 12618.321410 |
751 | Ireland | 1987 | 3539900.0 | Europe | 74.360 | 13872.866520 |
752 | Ireland | 1992 | 3557761.0 | Europe | 75.467 | 17558.815550 |
753 | Ireland | 1997 | 3667233.0 | Europe | 76.122 | 24521.947130 |
754 | Ireland | 2002 | 3879155.0 | Europe | 77.783 | 34077.049390 |
755 | Ireland | 2007 | 4109086.0 | Europe | 78.885 | 40675.996350 |
803 | Japan | 2007 | 127467972.0 | Asia | 82.603 | 31656.068060 |