Chapter 4 - Numeric data wrangling#

2022 August 25

For text preprocessing, see Chapter 7 “English text preprocessing basics”

wrangle

The pandas library#

Import the pandas library with the alias pd. This is just a shortcut to call its methods!

Use “dot notation” to apply its methods to the gapminder dataset, which is stored in a tabular .csv file.

To import the .csv file, use the .read_csv() pandas method. The only argument for now is the file path to a .csv file.

Learn more about the Gapminder data

import pandas as pd

# import the gapminder dataset
# !wget -P data/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/gapminder-FiveYearData.csv
gap = pd.read_csv("data/gapminder-FiveYearData.csv")

print(type(gap))

<class 'pandas.core.frame.DataFrame'>

Pandas methods#

Just a small handful of pandas methods will help you accomplish several key data wrangling tasks. How might the wrangling process look?

First, look at the data
Compute summary statistics
Subset rows
- first row
- first three rows
- rows 10 thru 14
Subset columns:
- one column
- multiple columns
Row and column subset
Subset by logical condition(s)

First, look at the data#

print or call the dataset to view its output. Use pandas methods to learn about the data!

print(gap)

          country  year         pop continent  lifeExp   gdpPercap
   Afghanistan  1952   8425333.0      Asia   28.801  779.445314
   Afghanistan  1957   9240934.0      Asia   30.332  820.853030
   Afghanistan  1962  10267083.0      Asia   31.997  853.100710
   Afghanistan  1967  11537966.0      Asia   34.020  836.197138
   Afghanistan  1972  13079460.0      Asia   36.088  739.981106
...           ...   ...         ...       ...      ...         ...
   Zimbabwe  1987   9216418.0    Africa   62.351  706.157306
   Zimbabwe  1992  10704340.0    Africa   60.377  693.420786
   Zimbabwe  1997  11404948.0    Africa   46.809  792.449960
   Zimbabwe  2002  11926563.0    Africa   39.989  672.038623
   Zimbabwe  2007  12311143.0    Africa   43.487  469.709298

[1704 rows x 6 columns]

gap

	country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710
3	Afghanistan	1967	11537966.0	Asia	34.020	836.197138
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106
...	...	...	...	...	...	...
1699	Zimbabwe	1987	9216418.0	Africa	62.351	706.157306
1700	Zimbabwe	1992	10704340.0	Africa	60.377	693.420786
1701	Zimbabwe	1997	11404948.0	Africa	46.809	792.449960
1702	Zimbabwe	2002	11926563.0	Africa	39.989	672.038623
1703	Zimbabwe	2007	12311143.0	Africa	43.487	469.709298

1704 rows × 6 columns

`.head()`#

# .head() shows the first five rows by default
gap.head()

	country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710
3	Afghanistan	1967	11537966.0	Asia	34.020	836.197138
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106

`.columns`#

# .columns shows the column names
# this is an attribute instead of a method - note the lack of parentheses ()
gap.columns

Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')

`.shape`#

# .shape shows the number of rows by columns
gap.shape

(1704, 6)

Summary statistics#

Produce summary statistics, including:

Count, mean, sd, quartiles, min/max
Tabulate frequencies

`.describe()`#

# produce summary statistics for numeric data
gap.describe()

	year	pop	lifeExp	gdpPercap
count	1704.00000	1.704000e+03	1704.000000	1704.000000
mean	1979.50000	2.960121e+07	59.474439	7215.327081
std	17.26533	1.061579e+08	12.917107	9857.454543
min	1952.00000	6.001100e+04	23.599000	241.165876
25%	1965.75000	2.793664e+06	48.198000	1202.060309
50%	1979.50000	7.023596e+06	60.712500	3531.846988
75%	1993.25000	1.958522e+07	70.845500	9325.462346
max	2007.00000	1.318683e+09	82.603000	113523.132900

`.mean()` and `.std()`#

# calculate mean and standard deviation of lifeExp
lifeExp_mean = gap["lifeExp"].mean()
lifeExp_sd = gap["lifeExp"].std()
print("Life expectancy mean:", lifeExp_mean)
print("Life expectancy sd:", lifeExp_sd)

Life expectancy mean: 59.474439366197174
Life expectancy sd: 12.917107415241192

`.groupby()` and `.count()`#

These two methods are useful for tabulating frequencies by a grouping variable!

# count the number of observations grouped by each continent
gap.groupby("continent").count()["country"]

continent
Africa      624
Americas    300
Asia        396
Europe      360
Oceania      24
Name: country, dtype: int64

Subset rows or columns#

Sampling data is necessary for many reasons, including quick sanity checks.

Slice a data frame by using bracket notation to specify start and end points [start : end]

The [start index is included and the end] index is excluded.

Remember that Python is a zero-indexed language, so starts counting from zero, not one.

Leave the start or end values blank to start from the beginning, or go to the end of a collection.

Row subset: slice just the first row#

gap[:1]

	country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314

Row subset: slice first three rows#

gap[:3]

	country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710

Row subset: slice rows 10 thru 14#

subset1 = gap[10:15]
subset1

	country	year	pop	continent	lifeExp	gdpPercap
10	Afghanistan	2002	25268405.0	Asia	42.129	726.734055
11	Afghanistan	2007	31889923.0	Asia	43.828	974.580338
12	Albania	1952	1282697.0	Europe	55.230	1601.056136
13	Albania	1957	1476505.0	Europe	59.280	1942.284244
14	Albania	1962	1728137.0	Europe	64.820	2312.888958

Column subset: one column#

# type the column name as a string in square brackets
gap['lifeExp']

     28.801
     30.332
     31.997
     34.020
     36.088
         ...  
  62.351
  60.377
  46.809
  39.989
  43.487
Name: lifeExp, Length: 1704, dtype: float64

Column subset: multiple columns#

# note the double sets of brackets
subset2 = gap[['continent', 'lifeExp', 'gdpPercap']]
subset2

	continent	lifeExp	gdpPercap
0	Asia	28.801	779.445314
1	Asia	30.332	820.853030
2	Asia	31.997	853.100710
3	Asia	34.020	836.197138
4	Asia	36.088	739.981106
...	...	...	...
1699	Africa	62.351	706.157306
1700	Africa	60.377	693.420786
1701	Africa	46.809	792.449960
1702	Africa	39.989	672.038623
1703	Africa	43.487	469.709298

1704 rows × 3 columns

Row and column subset#

# subset more than one column and rows 855 thru 858
subset3 = gap[['continent', 'lifeExp', 'gdpPercap']][855:859]
subset3

	continent	lifeExp	gdpPercap
855	Asia	64.624	80894.88326
856	Asia	67.712	109347.86700
857	Asia	69.343	59265.47714
858	Asia	71.309	31354.03573

# A column in a pandas data frame Pandas "Series" can be thought of like numpy arrays
# But, beware, they do not function exactly the same!
type(gap["lifeExp"])

pandas.core.series.Series

Subset by logical condition(s)#

# lifeExp is greater than 80
le2 = gap[gap['lifeExp'] > 80]
le2

	country	year	pop	continent	lifeExp	gdpPercap
70	Australia	2002	19546792.0	Oceania	80.370	30687.75473
71	Australia	2007	20434176.0	Oceania	81.235	34435.36744
251	Canada	2007	33390141.0	Americas	80.653	36319.23501
539	France	2007	61083916.0	Europe	80.657	30470.01670
670	Hong Kong China	2002	6762476.0	Asia	81.495	30209.01516
671	Hong Kong China	2007	6980412.0	Asia	82.208	39724.97867
694	Iceland	2002	288030.0	Europe	80.500	31163.20196
695	Iceland	2007	301931.0	Europe	81.757	36180.78919
767	Israel	2007	6426679.0	Asia	80.745	25523.27710
778	Italy	2002	57926999.0	Europe	80.240	27968.09817
779	Italy	2007	58147733.0	Europe	80.546	28569.71970
801	Japan	1997	125956499.0	Asia	80.690	28816.58499
802	Japan	2002	127065841.0	Asia	82.000	28604.59190
803	Japan	2007	127467972.0	Asia	82.603	31656.06806
1103	New Zealand	2007	4115771.0	Oceania	80.204	25185.00911
1151	Norway	2007	4627926.0	Europe	80.196	49357.19017
1427	Spain	2007	40448191.0	Europe	80.941	28821.06370
1474	Sweden	2002	8954175.0	Europe	80.040	29341.63093
1475	Sweden	2007	9031088.0	Europe	80.884	33859.74835
1486	Switzerland	2002	7361757.0	Europe	80.620	34480.95771
1487	Switzerland	2007	7554661.0	Europe	81.701	37506.41907

logical AND (`&`)#

All conditions must be satisfied to be included in the subset

# create subset that includes life expectancy greater than 81 AND pop < 500,000.
year2002 = gap[(gap["lifeExp"] > 81) & (gap["pop"] < 500000)]
year2002

	country	year	pop	continent	lifeExp	gdpPercap
695	Iceland	2007	301931.0	Europe	81.757	36180.78919

logical OR (`|`)#

Just one of multiple conditions must be satisfied to be included in the subset

# create a subset that includes country equals Ireland OR life expectancy greater than 82. 
ireland82 = gap[(gap["country"] == "Ireland") | (gap["lifeExp"] > 82)]
ireland82

	country	year	pop	continent	lifeExp	gdpPercap
671	Hong Kong China	2007	6980412.0	Asia	82.208	39724.978670
744	Ireland	1952	2952156.0	Europe	66.910	5210.280328
745	Ireland	1957	2878220.0	Europe	68.900	5599.077872
746	Ireland	1962	2830000.0	Europe	70.290	6631.597314
747	Ireland	1967	2900100.0	Europe	71.080	7655.568963
748	Ireland	1972	3024400.0	Europe	71.280	9530.772896
749	Ireland	1977	3271900.0	Europe	72.030	11150.981130
750	Ireland	1982	3480000.0	Europe	73.100	12618.321410
751	Ireland	1987	3539900.0	Europe	74.360	13872.866520
752	Ireland	1992	3557761.0	Europe	75.467	17558.815550
753	Ireland	1997	3667233.0	Europe	76.122	24521.947130
754	Ireland	2002	3879155.0	Europe	77.783	34077.049390
755	Ireland	2007	4109086.0	Europe	78.885	40675.996350
803	Japan	2007	127467972.0	Asia	82.603	31656.068060

Exercises#

Load the file “gapminder-FiveYearData.csv” and save it in a variable named gap
Print the column names
Compute the mean for one numeric column
Compute the mean for all numeric columns
Tabulate frequencies for the “continent” column
Compute mean lifeExp and dgpPercap by continent
Create a subset of gap that contains only countries with lifeExp greater than 75 and gdpPercap less than 5000.

Data visualization#

After importing, exploring, and subsetting your data, visualization is a common technique to perform next. Read Chapter 5 “Data visualization essentials” to get started.

Text Analysis and Machine Learning (TAML) Group

Chapter 4 - Numeric data wrangling

Contents

Chapter 4 - Numeric data wrangling#

The pandas library#

Pandas methods#

First, look at the data#

`.head()`#

`.columns`#

`.shape`#

Summary statistics#

`.describe()`#

`.mean()` and `.std()`#

`.groupby()` and `.count()`#

Subset rows or columns#

Row subset: slice just the first row#

Row subset: slice first three rows#

Row subset: slice rows 10 thru 14#

Column subset: one column#

Column subset: multiple columns#

Row and column subset#

Subset by logical condition(s)#

logical AND (`&`)#

logical OR (`|`)#

Exercises#

Data visualization#

Text Analysis and Machine Learning (TAML) Group

Chapter 4 - Numeric data wrangling

Contents

Chapter 4 - Numeric data wrangling#

The pandas library#

Pandas methods#

First, look at the data#

.head()#

.columns#

.shape#

Summary statistics#

.describe()#

.mean() and .std()#

.groupby() and .count()#

Subset rows or columns#

Row subset: slice just the first row#

Row subset: slice first three rows#

Row subset: slice rows 10 thru 14#

Column subset: one column#

Column subset: multiple columns#

Row and column subset#

Subset by logical condition(s)#

logical AND (&)#

logical OR (|)#

Exercises#

Data visualization#

`.head()`#

`.columns`#

`.shape`#

`.describe()`#

`.mean()` and `.std()`#

`.groupby()` and `.count()`#

logical AND (`&`)#

logical OR (`|`)#