iv. Numeric data wrangling#

For text preprocessing, see the January 19 notebook.

wrangle

The pandas library#

Import the pandas library with the alias pd.

Use “dot notation” to apply its methods to the dataset, which is stored in a tabular .csv file.

To import the .csv file, use the .read_csv() pandas method. The only argument is the file path to a .csv file.

Import the data below and investigate it by applying pandas methods to the data frame gap.

import pandas as pd

gap = pd.read_csv("data/gapminder-FiveYearData.csv")

print(type(gap))

<class 'pandas.core.frame.DataFrame'>

Pandas methods#

Just a small handful of pandas methods will help you accomplish several key data wrangling tasks. See the examplea code below.

First, look at the data
Summary statistics
Subset rows:
- first row
- first three rows
- rows 10 thru 14
Subset columns:
- one column
- multiple columns
Row and column subset
Subset by logical condition(s)

First, look at the data#

Use pandas methods to learn about the data!

`.head()`#

# look at first five rows by default
gap.head()

	country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710
3	Afghanistan	1967	11537966.0	Asia	34.020	836.197138
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106

`.columns`#

# View column names
# this is an attribute instead of a method - note the lack of parentheses ()
gap.columns

Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')

`.shape`#

# Show number of rows by columns
# this is an attribute instead of a method - note the lack of parentheses ()
gap.shape

(1704, 6)

Summary statistics#

Produce summary statistics, including:

Count, mean, sd, quartiles, min/max
Tabulate frequencies

`.describe()`#

# produce summary statistics for numeric data
gap.describe()

	year	pop	lifeExp	gdpPercap
count	1704.00000	1.704000e+03	1704.000000	1704.000000
mean	1979.50000	2.960121e+07	59.474439	7215.327081
std	17.26533	1.061579e+08	12.917107	9857.454543
min	1952.00000	6.001100e+04	23.599000	241.165876
25%	1965.75000	2.793664e+06	48.198000	1202.060309
50%	1979.50000	7.023596e+06	60.712500	3531.846988
75%	1993.25000	1.958522e+07	70.845500	9325.462346
max	2007.00000	1.318683e+09	82.603000	113523.132900

`.mean()` and `.std()`#

# calculate mean and standard deviation of lifeExp
lifeExp_mean = gap["lifeExp"].mean()
lifeExp_sd = gap["lifeExp"].std()
print(lifeExp_mean)
print(lifeExp_sd)

59.474439366197174
12.917107415241192

`.groupby()` and `.count()`#

# count the number of countries grouped by each continent
gap.groupby("continent").count()["country"]

continent
Africa      624
Americas    300
Asia        396
Europe      360
Oceania      24
Name: country, dtype: int64

Subset rows or columns#

Sampling data is necessary for many reasons, including quick sanity checks.

Slice a data frame by using bracket notation to specify start and end points [start : end]

The [start index is included and the end] index is excluded.

Remember that Python is a zero-indexed language, so starts counting from zero, not one.

Leave the start or end values blank to start from the beginning, or go to the end of a collection.

Row subset: slice just the first row#

gap[:1]

	country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314

Row subset: slice first three rows#

gap[:3]

	country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710

Row subset: slice rows 10 thru 14#

subset1 = gap[10:15]
subset1

	country	year	pop	continent	lifeExp	gdpPercap
10	Afghanistan	2002	25268405.0	Asia	42.129	726.734055
11	Afghanistan	2007	31889923.0	Asia	43.828	974.580338
12	Albania	1952	1282697.0	Europe	55.230	1601.056136
13	Albania	1957	1476505.0	Europe	59.280	1942.284244
14	Albania	1962	1728137.0	Europe	64.820	2312.888958

Column subset: one column#

# type the column name as a string in square brackets
gap['lifeExp']

     28.801
     30.332
     31.997
     34.020
     36.088
         ...  
  62.351
  60.377
  46.809
  39.989
  43.487
Name: lifeExp, Length: 1704, dtype: float64

Column subset: multiple columns#

# note the double sets of brackets
subset2 = gap[['continent', 'lifeExp', 'gdpPercap']]
subset2

	continent	lifeExp	gdpPercap
0	Asia	28.801	779.445314
1	Asia	30.332	820.853030
2	Asia	31.997	853.100710
3	Asia	34.020	836.197138
4	Asia	36.088	739.981106
...	...	...	...
1699	Africa	62.351	706.157306
1700	Africa	60.377	693.420786
1701	Africa	46.809	792.449960
1702	Africa	39.989	672.038623
1703	Africa	43.487	469.709298

1704 rows × 3 columns

Row and column subset#

# subset more than one column and rows 855 thru 858
subset3 = gap[['continent', 'lifeExp', 'gdpPercap']][855:859]
subset3

	continent	lifeExp	gdpPercap
855	Asia	64.624	80894.88326
856	Asia	67.712	109347.86700
857	Asia	69.343	59265.47714
858	Asia	71.309	31354.03573

type(gap["lifeExp"])

pandas.core.series.Series

Subset by logical condition(s)#

# lifeExp is greater than 80
le2 = gap[gap['lifeExp'] > 81]
le2

	country	year	pop	continent	lifeExp	gdpPercap
71	Australia	2007	20434176.0	Oceania	81.235	34435.36744
670	Hong Kong China	2002	6762476.0	Asia	81.495	30209.01516
671	Hong Kong China	2007	6980412.0	Asia	82.208	39724.97867
695	Iceland	2007	301931.0	Europe	81.757	36180.78919
802	Japan	2002	127065841.0	Asia	82.000	28604.59190
803	Japan	2007	127467972.0	Asia	82.603	31656.06806
1487	Switzerland	2007	7554661.0	Europe	81.701	37506.41907

# logical AND (all conditions must be satisfied to be included)

# create subset that includes life expectancy greater than 81 AND pop < 500,000.

year2002 = gap[(gap["lifeExp"] > 81) & (gap["pop"] < 500000)]
year2002

	country	year	pop	continent	lifeExp	gdpPercap
695	Iceland	2007	301931.0	Europe	81.757	36180.78919

# logical OR (one of multiple conditions must be satisfied to be included)

# create subset that includes country equals Ireland OR life expectancy greater than 82. 

ireland82 = gap[(gap["country"] == "Ireland") | (gap["lifeExp"] > 82)]
ireland82

	country	year	pop	continent	lifeExp	gdpPercap
671	Hong Kong China	2007	6980412.0	Asia	82.208	39724.978670
744	Ireland	1952	2952156.0	Europe	66.910	5210.280328
745	Ireland	1957	2878220.0	Europe	68.900	5599.077872
746	Ireland	1962	2830000.0	Europe	70.290	6631.597314
747	Ireland	1967	2900100.0	Europe	71.080	7655.568963
748	Ireland	1972	3024400.0	Europe	71.280	9530.772896
749	Ireland	1977	3271900.0	Europe	72.030	11150.981130
750	Ireland	1982	3480000.0	Europe	73.100	12618.321410
751	Ireland	1987	3539900.0	Europe	74.360	13872.866520
752	Ireland	1992	3557761.0	Europe	75.467	17558.815550
753	Ireland	1997	3667233.0	Europe	76.122	24521.947130
754	Ireland	2002	3879155.0	Europe	77.783	34077.049390
755	Ireland	2007	4109086.0	Europe	78.885	40675.996350
803	Japan	2007	127467972.0	Asia	82.603	31656.068060

Text Analysis and Machine Learning (TAML) Group

iv. Numeric data wrangling

Contents

iv. Numeric data wrangling#

The pandas library#

Pandas methods#

First, look at the data#

`.head()`#

`.columns`#

`.shape`#

Summary statistics#

`.describe()`#

`.mean()` and `.std()`#

`.groupby()` and `.count()`#

Subset rows or columns#

Row subset: slice just the first row#

Row subset: slice first three rows#

Row subset: slice rows 10 thru 14#

Column subset: one column#

Column subset: multiple columns#

Row and column subset#

Subset by logical condition(s)#

Text Analysis and Machine Learning (TAML) Group

iv. Numeric data wrangling

Contents

iv. Numeric data wrangling#

The pandas library#

Pandas methods#

First, look at the data#

.head()#

.columns#

.shape#

Summary statistics#

.describe()#

.mean() and .std()#

.groupby() and .count()#

Subset rows or columns#

Row subset: slice just the first row#

Row subset: slice first three rows#

Row subset: slice rows 10 thru 14#

Column subset: one column#

Column subset: multiple columns#

Row and column subset#

Subset by logical condition(s)#

`.head()`#

`.columns`#

`.shape`#

`.describe()`#

`.mean()` and `.std()`#

`.groupby()` and `.count()`#