v. Visualization essentials

viz

# import libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# make sure plots show in the notebook
%matplotlib inline

After importing data, you should examine it closely.

  1. Look at the raw data ans perform rough checks of your assumptions

  2. Compute summary statistics

  3. Produce visualizations to illustrate obvious - or not so obvious - trends in the data

Plotting with seaborn

First, a note about matplotlib

There are many different ways to visualize data in Python but they virtually all rely on matplotlib. You should take some time to read through the tutorial: https://matplotlib.org/stable/tutorials/introductory/pyplot.html.

Because many other libraries depend on matplotlib under the hood, you should familiarize yourself with the basics. For example:

import matplotlib.pyplot as plt
x = [1,2,3,4,5]
y = [2,4,6,8,20]
plt.scatter(x, y)
plt.title('title')
plt.ylabel('some numbers')
plt.xlabel('x-axis label')
plt.show()
_images/viz_5_0.png

Visualization best practices

Consult Wilke’s Fundamentals of Data Visualization https://clauswilke.com/dataviz/ for discussions of theory and best practices.

The goal of data visualization is to accurately communicate something about the data. This could be an amount, a distribution, relationship, predictions, or the results of sorted data.

Utilize characteristics of different data types to manipulate the aesthetics of plot axes and coordinate systems, color scales and gradients, and formatting and arrangements to impress your audience!

wilke

wilke12

Summary statistics - pandas review

# load the Gapminder dataset
gap = pd.read_csv("data/gapminder-FiveYearData.csv")
# view column names of Gapminder data
gap.columns
Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')

All columns

# mean of all variables except country
gap.groupby('continent').mean()
year pop lifeExp gdpPercap
continent
Africa 1979.5 9.916003e+06 48.865330 2193.754578
Americas 1979.5 2.450479e+07 64.658737 7136.110356
Asia 1979.5 7.703872e+07 60.064903 7902.150428
Europe 1979.5 1.716976e+07 71.903686 14469.475533
Oceania 1979.5 8.874672e+06 74.326208 18621.609223

One column

# Mean life expectancy for each continent
gap.groupby('continent')["lifeExp"].mean()
continent
Africa      48.865330
Americas    64.658737
Asia        60.064903
Europe      71.903686
Oceania     74.326208
Name: lifeExp, dtype: float64

Multiple columns

# Mean lifeExp and gdpPercap for each continent
le_table = gap.groupby('continent')[["lifeExp", "gdpPercap"]].mean()
le_table
lifeExp gdpPercap
continent
Africa 48.865330 2193.754578
Americas 64.658737 7136.110356
Asia 60.064903 7902.150428
Europe 71.903686 14469.475533
Oceania 74.326208 18621.609223

Basic plots

  1. Histogram: visualize distribution of one continuous (i.e., integer or float) variable.

  2. Boxplot: visualize the distribution of one continuous variable.

  3. Scatterplot: visualize the relationship between two continuous variables.

Histogram

Use a histogram to plot the distribution of one continuous (i.e., integer or float) variable.

# all data
sns.histplot(data = gap,
            x = 'lifeExp'); 
_images/viz_19_0.png
# by continent
sns.histplot(data = gap, 
            x = 'lifeExp', 
            hue = 'continent');
_images/viz_20_0.png

Boxplot

Boxplots can be used to visualize one distribution as well, and illustrate different aspects of the table of summary statistics.

# summary statistics
gap.describe()
year pop lifeExp gdpPercap
count 1704.00000 1.704000e+03 1704.000000 1704.000000
mean 1979.50000 2.960121e+07 59.474439 7215.327081
std 17.26533 1.061579e+08 12.917107 9857.454543
min 1952.00000 6.001100e+04 23.599000 241.165876
25% 1965.75000 2.793664e+06 48.198000 1202.060309
50% 1979.50000 7.023596e+06 60.712500 3531.846988
75% 1993.25000 1.958522e+07 70.845500 9325.462346
max 2007.00000 1.318683e+09 82.603000 113523.132900
# all data
sns.boxplot(data = gap,
            y = 'lifeExp', 
            color = 'gray')
<AxesSubplot:ylabel='lifeExp'>
_images/viz_23_1.png
gap.groupby('continent').count()
country year pop lifeExp gdpPercap
continent
Africa 624 624 624 624 624
Americas 300 300 300 300 300
Asia 396 396 396 396 396
Europe 360 360 360 360 360
Oceania 24 24 24 24 24
# by continent
sns.boxplot(data = gap,
            x = 'continent', 
            y = 'lifeExp').set_title('Boxplots');
_images/viz_25_0.png
# custom colors
sns.boxplot(data = gap, 
            x = 'continent', 
            y = 'lifeExp', 
            palette = ['gray', '#8C1515', '#D2C295', '#00505C', 'white']).set_title('Boxplots');
_images/viz_26_0.png

Scatterplot

Scatterplots are useful to illustrate the relationship between two continuous variables. Below are several options for you to try.

### change figure size
sns.set(rc = {'figure.figsize':(12,8)})

### change background
sns.set_style("ticks")

# commented code
ex1 = sns.scatterplot(
    
    # dataset
    data = gap,
    
    # x-axis variable to plot
    x = 'lifeExp', 
    
    # y-axis variable to plot
    y = 'gdpPercap', 
    
    # color points by categorical variable
    hue = 'continent', 
    
    # point transparency
    alpha = 1)

### log scale y-axis
ex1.set(yscale="log")

### set axis labels
ex1.set_xlabel("Life expectancy (Years)", fontsize = 20)
ex1.set_ylabel("GDP per cap (US$)", fontsize = 20);

### unhashtag to save 
### NOTE: this might only work on local Python installation and not JupyterLab - try it!

# plt.savefig('img/scatter_gap.pdf')
_images/viz_28_0.png

Quiz - Penguins dataset

Learn more about the biological and spatial characteristics of penguins!

penguins

  1. Use seaborn to make one of each of the plots in the image below. Check out the seaborn tutorial for more examples and formatting options: https://seaborn.pydata.org/tutorial/function_overview.html

  2. What might you conclude about the species of penguins from this dataset?

sns

Map of Antarctica

Below is a map of Antarctica past the southernmost tip of the South American continent.

The distance from the Biscoe Islands (Renaud) to the Torgersen and Dream Islands is about 140 km.

antarctica

# get help with the question mark
# sns.scatterplot?
# load penguins data
penguins = pd.read_csv('data/penguins.csv')
# hint: 
penguins.groupby('island').count()
species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
island
Biscoe 168 167 167 167 167 163
Dream 124 124 124 124 124 123
Torgersen 52 51 51 51 51 47
# hint:
penguins.groupby('island').mean()
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
island
Biscoe 45.257485 15.874850 209.706587 4716.017964
Dream 44.167742 18.344355 193.072581 3712.903226
Torgersen 38.950980 18.429412 191.196078 3706.372549
# 1. relational - scatterplot
# your answer here:
# 2. relational - lineplot
# your answer here:
# 3. distributions - histplot
# your answer here:
# 4. distributions - kdeplot
# your answer here:
# 5. distributions - ecdfplot
# your answer here:
# 6. distributions - rugplot
# your answer here:
# 7. categorical - stripplot
# your answer here:
# 8. categorical - swarmplot
# your answer here:
# 9. categorical - boxplot
# your answer here:
# 10. categorical - violinplot
# your answer here:
# 11. categorical - pointplot
# your answer here:
# 12. categorical - barplot
# your answer here:

Quiz - Gapminder dataset

Make the twelve plots using the Gapminder dataset.

What can you conclude about income and life expectancy?

Visit https://www.gapminder.org/ to learn more!

Things you are probably wrong about!

gapm

See the survey and correct response rate of the Sustainable Development Misconception Study 2020

# 1. relational - scatterplot
# your answer here:
# 2. relational - lineplot
# your answer here:
# 3. distributions - histplot
# your answer here:
# 4. distributions - kdeplot
# your answer here:
# 5. distributions - ecdfplot
# your answer here:
# 6. distributions - rugplot
# your answer here:
# 7. categorical - stripplot
# your answer here:
# 8. categorical - swarmplot
# your answer here:
# 9. categorical - boxplot
# your answer here:
# 10. categorical - violinplot
# your answer here:
# 11. categorical - pointplot
# your answer here:
# 12. categorical - barplot
# your answer here: