Chapter 4 - Numeric data wrangling#

2023 April 7

For text preprocessing, see Chapter 7 “English text preprocessing basics”

wrangle

The pandas library#

Import the pandas library with the alias pd. This can be a useful shortcut to call its methods!

Use “dot notation” to apply its methods to the gapminder dataset, which is stored in a tabular .csv file.

To import the .csv file, use the .read_csv() pandas method. The only argument for now is the file path to a .csv file.

Learn more about the Gapminder data

import pandas as pd

# import the gapminder dataset
# !wget -P data/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/spring2023/data/gapminder-FiveYearData.csv
gap = pd.read_csv("data/gapminder-FiveYearData.csv")

print(type(gap))

<class 'pandas.core.frame.DataFrame'>

Pandas methods#

Just a small handful of pandas methods will help you accomplish several key data wrangling tasks. How might the wrangling process look?

First, look at the data
Compute summary statistics
Subset rows
- first row
- first three rows
- rows 10 thru 14
Subset columns:
- one column
- multiple columns
Row and column subset
Subset by logical condition(s)

First, look at the data#

print or call the dataset to view its output. Use pandas methods to learn about the data!

print(gap)

          country  year         pop continent  lifeExp   gdpPercap
   Afghanistan  1952   8425333.0      Asia   28.801  779.445314
   Afghanistan  1957   9240934.0      Asia   30.332  820.853030
   Afghanistan  1962  10267083.0      Asia   31.997  853.100710
   Afghanistan  1967  11537966.0      Asia   34.020  836.197138
   Afghanistan  1972  13079460.0      Asia   36.088  739.981106
...           ...   ...         ...       ...      ...         ...
   Zimbabwe  1987   9216418.0    Africa   62.351  706.157306
   Zimbabwe  1992  10704340.0    Africa   60.377  693.420786
   Zimbabwe  1997  11404948.0    Africa   46.809  792.449960
   Zimbabwe  2002  11926563.0    Africa   39.989  672.038623
   Zimbabwe  2007  12311143.0    Africa   43.487  469.709298

[1704 rows x 6 columns]

gap

	country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710
3	Afghanistan	1967	11537966.0	Asia	34.020	836.197138
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106
...	...	...	...	...	...	...
1699	Zimbabwe	1987	9216418.0	Africa	62.351	706.157306
1700	Zimbabwe	1992	10704340.0	Africa	60.377	693.420786
1701	Zimbabwe	1997	11404948.0	Africa	46.809	792.449960
1702	Zimbabwe	2002	11926563.0	Africa	39.989	672.038623
1703	Zimbabwe	2007	12311143.0	Africa	43.487	469.709298

1704 rows × 6 columns

`.head()`#

# .head() shows the first five rows by default
gap.head()

	country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710
3	Afghanistan	1967	11537966.0	Asia	34.020	836.197138
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106

`.columns`#

# .columns shows the column names
# this is an attribute instead of a method - note the lack of parentheses ()
gap.columns

Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')

`.shape`#

# .shape shows the number of rows by columns
gap.shape

(1704, 6)

`.info()`#

# .info() shows the compact structure of the dataset, including:
# class, dimensions, column names, number of missing values, and types
gap.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   year       1704 non-null   int64  
 2   pop        1704 non-null   float64
 3   continent  1704 non-null   object 
 4   lifeExp    1704 non-null   float64
 5   gdpPercap  1704 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 80.0+ KB

Summary statistics#

Produce summary statistics, including:

Count, mean, sd, quartiles, min/max
Tabulate frequencies

`.describe()`#

# produce summary statistics for numeric data
gap.describe()

	year	pop	lifeExp	gdpPercap
count	1704.00000	1.704000e+03	1704.000000	1704.000000
mean	1979.50000	2.960121e+07	59.474439	7215.327081
std	17.26533	1.061579e+08	12.917107	9857.454543
min	1952.00000	6.001100e+04	23.599000	241.165876
25%	1965.75000	2.793664e+06	48.198000	1202.060309
50%	1979.50000	7.023596e+06	60.712500	3531.846988
75%	1993.25000	1.958522e+07	70.845500	9325.462346
max	2007.00000	1.318683e+09	82.603000	113523.132900

`.mean()` and `.std()`#

# calculate mean and standard deviation of lifeExp
lifeExp_mean = gap["lifeExp"].mean()
lifeExp_sd = gap["lifeExp"].std()
print("Life expectancy mean:", lifeExp_mean)
print("Life expectancy sd:", lifeExp_sd)

Life expectancy mean: 59.47443936619714
Life expectancy sd: 12.917107415241187

`.groupby()` and `.count()`#

These two methods are useful for tabulating frequencies by a grouping variable!

import pandas as pd
gap = pd.read_csv("data/gapminder-FiveYearData.csv")

pd.set_option('display.float_format', lambda x: '%.6f' % x)

# count the number of observations grouped by each continent
gap.groupby("continent").mean()["pop"]

continent
Africa      9916003.142628
Americas   24504794.996667
Asia       77038721.974747
Europe     17169764.733333
Oceania     8874672.333333
Name: pop, dtype: float64

Subset rows or columns#

Sampling data is necessary for many reasons, including quick sanity checks.

Slice a data frame by using bracket notation to specify start and end points [start : end]

The [start index is included and the end] index is excluded.

Remember that Python is a zero-indexed language, so starts counting from zero, not one.

Leave the start or end values blank to start from the beginning, or go to the end of a collection.

Row subset: slice just the first row#

gap[:1]

	country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.000000	Asia	28.801000	779.445314

Row subset: slice first three rows#

gap[:3]

	country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.000000	Asia	28.801000	779.445314
1	Afghanistan	1957	9240934.000000	Asia	30.332000	820.853030
2	Afghanistan	1962	10267083.000000	Asia	31.997000	853.100710

Row subset: slice rows 10 thru 14#

subset1 = gap[10:15]
subset1

	country	year	pop	continent	lifeExp	gdpPercap
10	Afghanistan	2002	25268405.000000	Asia	42.129000	726.734055
11	Afghanistan	2007	31889923.000000	Asia	43.828000	974.580338
12	Albania	1952	1282697.000000	Europe	55.230000	1601.056136
13	Albania	1957	1476505.000000	Europe	59.280000	1942.284244
14	Albania	1962	1728137.000000	Europe	64.820000	2312.888958

Column subset: one column#

# type the column name as a string in square brackets
gap['lifeExp']

    28.801000
    30.332000
    31.997000
    34.020000
    36.088000
          ...   
 62.351000
 60.377000
 46.809000
 39.989000
 43.487000
Name: lifeExp, Length: 1704, dtype: float64

Column subset: multiple columns#

# note the double sets of brackets
subset2 = gap[['continent', 'lifeExp', 'gdpPercap']]
subset2

	continent	lifeExp	gdpPercap
0	Asia	28.801000	779.445314
1	Asia	30.332000	820.853030
2	Asia	31.997000	853.100710
3	Asia	34.020000	836.197138
4	Asia	36.088000	739.981106
...	...	...	...
1699	Africa	62.351000	706.157306
1700	Africa	60.377000	693.420786
1701	Africa	46.809000	792.449960
1702	Africa	39.989000	672.038623
1703	Africa	43.487000	469.709298

1704 rows × 3 columns

Row and column subset#

# subset more than one column and rows 855 thru 858
subset3 = gap[['continent', 'lifeExp', 'gdpPercap']][855:859]
subset3

	continent	lifeExp	gdpPercap
855	Asia	64.624000	80894.883260
856	Asia	67.712000	109347.867000
857	Asia	69.343000	59265.477140
858	Asia	71.309000	31354.035730

# A column in a pandas data frame Pandas "Series" can be thought of like numpy arrays
# But, beware, they do not function exactly the same!
type(gap["lifeExp"])

pandas.core.series.Series

Subset by logical condition(s)#

# lifeExp is greater than 81
le2 = gap[gap['lifeExp'] > 81]
le2

	country	year	pop	continent	lifeExp	gdpPercap
71	Australia	2007	20434176.000000	Oceania	81.235000	34435.367440
670	Hong Kong China	2002	6762476.000000	Asia	81.495000	30209.015160
671	Hong Kong China	2007	6980412.000000	Asia	82.208000	39724.978670
695	Iceland	2007	301931.000000	Europe	81.757000	36180.789190
802	Japan	2002	127065841.000000	Asia	82.000000	28604.591900
803	Japan	2007	127467972.000000	Asia	82.603000	31656.068060
1487	Switzerland	2007	7554661.000000	Europe	81.701000	37506.419070

logical AND (`&`)#

All conditions must be satisfied to be included in the subset

# create subset that includes life expectancy greater than 81 AND pop < 500,000.
year2002 = gap[(gap["lifeExp"] > 81) & (gap["pop"] < 500000)]
year2002

	country	year	pop	continent	lifeExp	gdpPercap
695	Iceland	2007	301931.000000	Europe	81.757000	36180.789190

logical OR (`|`)#

Just one of multiple conditions must be satisfied to be included in the subset

# create a subset that includes country equals Ireland OR life expectancy greater than 82. 
ireland82 = gap[(gap["country"] == "Ireland") | (gap["lifeExp"] > 82)]
ireland82

	country	year	pop	continent	lifeExp	gdpPercap
671	Hong Kong China	2007	6980412.000000	Asia	82.208000	39724.978670
744	Ireland	1952	2952156.000000	Europe	66.910000	5210.280328
745	Ireland	1957	2878220.000000	Europe	68.900000	5599.077872
746	Ireland	1962	2830000.000000	Europe	70.290000	6631.597314
747	Ireland	1967	2900100.000000	Europe	71.080000	7655.568963
748	Ireland	1972	3024400.000000	Europe	71.280000	9530.772896
749	Ireland	1977	3271900.000000	Europe	72.030000	11150.981130
750	Ireland	1982	3480000.000000	Europe	73.100000	12618.321410
751	Ireland	1987	3539900.000000	Europe	74.360000	13872.866520
752	Ireland	1992	3557761.000000	Europe	75.467000	17558.815550
753	Ireland	1997	3667233.000000	Europe	76.122000	24521.947130
754	Ireland	2002	3879155.000000	Europe	77.783000	34077.049390
755	Ireland	2007	4109086.000000	Europe	78.885000	40675.996350
803	Japan	2007	127467972.000000	Asia	82.603000	31656.068060

Exercises#

Load the file “gapminder-FiveYearData.csv” and save it in a variable named gap
Print the column names
Compute the mean for one numeric column
Compute the mean for all numeric columns
Tabulate frequencies for the “continent” column
Compute mean lifeExp and dgpPercap by continent
Create a subset of gap that contains only countries with lifeExp greater than 75 and gdpPercap less than 5000.

# 1.
gap = pd.read_csv("data/gapminder-FiveYearData.csv")
gap

	country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.000000	Asia	28.801000	779.445314
1	Afghanistan	1957	9240934.000000	Asia	30.332000	820.853030
2	Afghanistan	1962	10267083.000000	Asia	31.997000	853.100710
3	Afghanistan	1967	11537966.000000	Asia	34.020000	836.197138
4	Afghanistan	1972	13079460.000000	Asia	36.088000	739.981106
...	...	...	...	...	...	...
1699	Zimbabwe	1987	9216418.000000	Africa	62.351000	706.157306
1700	Zimbabwe	1992	10704340.000000	Africa	60.377000	693.420786
1701	Zimbabwe	1997	11404948.000000	Africa	46.809000	792.449960
1702	Zimbabwe	2002	11926563.000000	Africa	39.989000	672.038623
1703	Zimbabwe	2007	12311143.000000	Africa	43.487000	469.709298

1704 rows × 6 columns

# 2. 
gap.columnsns

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[26], line 2
      1 # 2. 
----> 2 gap.columnsns

File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py:5575, in NDFrame.__getattr__(self, name)
   5568 if (
   5569     name not in self._internal_names_set
   5570     and name not in self._metadata
   5571     and name not in self._accessors
   5572     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5573 ):
   5574     return self[name]
-> 5575 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'columnsns'

# 3. compute mean for one numeric column
gap["pop"].mean()

29601212.32511736

gap.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   year       1704 non-null   int64  
 2   pop        1704 non-null   float64
 3   continent  1704 non-null   object 
 4   lifeExp    1704 non-null   float64
 5   gdpPercap  1704 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 80.0+ KB

# 4. compute mean for ALL numeric columns
# gap.describe()
gap.mean()

/var/folders/9g/fhnd1v790cj5ccxlv4rcvsy40000gq/T/ipykernel_60936/4035141191.py:3: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  gap.mean()

year            1979.500000
pop         29601212.325117
lifeExp           59.474439
gdpPercap       7215.327081
dtype: float64

sum(gap.groupby("continent").count()["country"] / 12)

142.0

gap.shape

(1704, 6)

142 * 12

# 6.
gap.groupby("continent")[["lifeExp", "gdpPercap"]].mean()

	lifeExp	gdpPercap
continent
Africa	48.865330	2193.754578
Americas	64.658737	7136.110356
Asia	60.064903	7902.150428
Europe	71.903686	14469.475533
Oceania	74.326208	18621.609223

# Create a subset of gap that contains only countries with lifeExp greater 
# than 75 and gdpPercap less than 5000.
healthy_poor = gap[(gap["lifeExp"] > 75) & (gap["gdpPercap"] < 5000)]
healthy_poor

	country	year	pop	continent	lifeExp	gdpPercap
22	Albania	2002	3508512.000000	Europe	75.651000	4604.211737

gap.iloc[1501]

country              Taiwan
year                   1957
pop         10164215.000000
continent              Asia
lifeExp           62.400000
gdpPercap       1507.861290
Name: 1501, dtype: object

gap.columns

Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')

gap[gap.columns[[2, 3]]]

	pop	continent
0	8425333.000000	Asia
1	9240934.000000	Asia
2	10267083.000000	Asia
3	11537966.000000	Asia
4	13079460.000000	Asia
...	...	...
1699	9216418.000000	Africa
1700	10704340.000000	Africa
1701	11404948.000000	Africa
1702	11926563.000000	Africa
1703	12311143.000000	Africa

1704 rows × 2 columns

gap.ix[2]

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[45], line 1
----> 1 gap.ix[2]

File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py:5575, in NDFrame.__getattr__(self, name)
   5568 if (
   5569     name not in self._internal_names_set
   5570     and name not in self._metadata
   5571     and name not in self._accessors
   5572     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5573 ):
   5574     return self[name]
-> 5575 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'ix'

Data visualization#

After importing, exploring, and subsetting your data, visualization is a common technique to perform next. Read Chapter 5 “Data visualization essentials” to get started.

Text Analysis and Machine Learning (TAML) Group

Chapter 4 - Numeric data wrangling

Contents

Chapter 4 - Numeric data wrangling#

The pandas library#

Pandas methods#

First, look at the data#

`.head()`#

`.columns`#

`.shape`#

`.info()`#

Summary statistics#

`.describe()`#

`.mean()` and `.std()`#

`.groupby()` and `.count()`#

Subset rows or columns#

Row subset: slice just the first row#

Row subset: slice first three rows#

Row subset: slice rows 10 thru 14#

Column subset: one column#

Column subset: multiple columns#

Row and column subset#

Subset by logical condition(s)#

logical AND (`&`)#

logical OR (`|`)#

Exercises#

Data visualization#

Text Analysis and Machine Learning (TAML) Group

Chapter 4 - Numeric data wrangling

Contents

Chapter 4 - Numeric data wrangling#

The pandas library#

Pandas methods#

First, look at the data#

.head()#

.columns#

.shape#

.info()#

Summary statistics#

.describe()#

.mean() and .std()#

.groupby() and .count()#

Subset rows or columns#

Row subset: slice just the first row#

Row subset: slice first three rows#

Row subset: slice rows 10 thru 14#

Column subset: one column#

Column subset: multiple columns#

Row and column subset#

Subset by logical condition(s)#

logical AND (&)#

logical OR (|)#

Exercises#

Data visualization#

`.head()`#

`.columns`#

`.shape`#

`.info()`#

`.describe()`#

`.mean()` and `.std()`#

`.groupby()` and `.count()`#

logical AND (`&`)#

logical OR (`|`)#