Solutions#

Example solutions for challenge exercises from each chapter in this book.

Chapter 1 - Exercises#

You will find challenge exercises to work on at the end of each chapter. They will require you to write code such as that found in the cell at the top of this notebook.
Click the “Colab” badge at the top of this notebook to open it in the Colaboratory environment. Press shift and enter simultaneously on your keyboard to run the code and draw your lucky card!

Remember: Press shift and enter on your keyboard to run a cell.

# import necessary librarys to make the code work
import random
import calendar
from datetime import date, datetime

# define the deck and suits as character strings and split them on the spaces
deck = 'ace two three four five six seven eight nine ten jack queen king'.split()
suit = 'spades clubs hearts diamonds'.split()
print(deck)
print(suit)

['ace', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'jack', 'queen', 'king']
['spades', 'clubs', 'hearts', 'diamonds']

# define today's day and date
today = calendar.day_name[date.today().weekday()]
date = datetime.today().strftime('%Y-%m-%d')
print(today)
print(date)

Saturday
2022-10-29

# randomly sample the card value and suit
select_value = random.sample(deck, 1)[0]
select_suit = random.sample(suit, 1)[0]
print(select_value)
print(select_suit)

jack
diamonds

# combine the character strings and variables into the final statement
print("\nWelcome to TAML at SSDS!")
print("\nYour lucky card for " + today + " " + date + " is: " + select_value + " of " + select_suit)

Welcome to TAML at SSDS!

Your lucky card for Saturday 2022-10-29 is: jack of diamonds

Chapter 2 - Exercises#

(Required) Set up your Google Colaboratory (Colab) environment following the instructions in #1 listed above.
(Optional) Check that you can correctly open these notebooks in Jupyter Lab.
(Optional) Install Python Anaconda distribution on your machine.

See 2_Python_environments.ipynb for instructions.

Chapter 3 - Exercises#

Define one variablez for each of the four data types introduced above: 1) string, 2) boolean, 3) float, and 4) integer.
Define two lists that contain four elements each.
Define a dictionary that containts the two lists from #2 above.
Import the file “dracula.txt”. Save it in a variable named drac
Import the file “penguins.csv”. Save it in a variable named pen
Figure out how to find help to export just the first 1000 characters of drac as a .txt file named “dracula_short.txt”
Figure out how to export the pen dataframe as a file named “penguins_saved.csv”

If you encounter error messages, which ones?

#1 
string1 = "Hello!"
string2 = "This is a sentence."
print(string1)
print(string2)

Hello!
This is a sentence.

bool1 = True
bool2 = False
print(bool1)
print(bool2)

True
False

float1 = 3.14
float2 = 12.345
print(float1)
print(float2)

3.14
12.345

integer1 = 8
integer2 = 4356
print(integer1)
print(integer2)

8
4356

#2
list1 = [integer2, string2, float1, "My name is:"]
list2 = [3, True, "What?", string1]
print(list1)
print(list2)

[4356, 'This is a sentence.', 3.14, 'My name is:']
[3, True, 'What?', 'Hello!']

#3
dict_one = {"direction": "up",
           "code": 1234,
           "first_list": list1,
           "second_list": list2}
dict_one

{'direction': 'up',
 'code': 1234,
 'first_list': [4356, 'This is a sentence.', 3.14, 'My name is:'],
 'second_list': [3, True, 'What?', 'Hello!']}

#4
# !wget -P data/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/dracula.txt
drac = open("data/dracula.txt").read()
# print(drac)

#5
import pandas as pd
# !wget -P data/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/penguins.csv
pen = pd.read_csv("data/penguins.csv")
pen

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	MALE
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	FEMALE
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	FEMALE
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	FEMALE
...	...	...	...	...	...	...	...
339	Gentoo	Biscoe	NaN	NaN	NaN	NaN	NaN
340	Gentoo	Biscoe	46.8	14.3	215.0	4850.0	FEMALE
341	Gentoo	Biscoe	50.4	15.7	222.0	5750.0	MALE
342	Gentoo	Biscoe	45.2	14.8	212.0	5200.0	FEMALE
343	Gentoo	Biscoe	49.9	16.1	213.0	5400.0	MALE

344 rows × 7 columns

#6
# first slice the string you want to save
drac_short = drac[:1000]

# second, open in write mode and write the file to the data directory!
with open('data/dracula_short.txt', 'w', encoding='utf-8') as f:
    f.write(drac_short)

# You can also copy files from Colab to your Google Drive
# Mount your GDrive
# from google.colab import drive
# drive.mount('/content/drive')

# Copy a file from Colab to GDrive
# !cp data/dracula_short.txt /content/drive/MyDrive

#7
pen.to_csv("data/penguins_saved.csv")

# !cp data/penguins_saved.csv /content/drive/MyDrive

Chapter 4 - Exercises#

Load the file “gapminder-FiveYearData.csv” and save it in a variable named gap
Print the column names
Compute the mean for one numeric column
Compute the mean for all numeric columns
Tabulate frequencies for the “continent” column
Compute mean lifeExp and dgpPercap by continent
Create a subset of gap that contains only countries with lifeExp greater than 75 and gdpPercap less than 5000.

#1
import pandas as pd
# !wget -P data/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/gapminder-FiveYearData.csv
gap = pd.read_csv("data/gapminder-FiveYearData.csv")
gap

	country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710
3	Afghanistan	1967	11537966.0	Asia	34.020	836.197138
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106
...	...	...	...	...	...	...
1699	Zimbabwe	1987	9216418.0	Africa	62.351	706.157306
1700	Zimbabwe	1992	10704340.0	Africa	60.377	693.420786
1701	Zimbabwe	1997	11404948.0	Africa	46.809	792.449960
1702	Zimbabwe	2002	11926563.0	Africa	39.989	672.038623
1703	Zimbabwe	2007	12311143.0	Africa	43.487	469.709298

1704 rows × 6 columns

#2
gap.columns

Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')

#3
gap["lifeExp"].mean()

59.474439366197174

# or
gap.describe()

	year	pop	lifeExp	gdpPercap
count	1704.00000	1.704000e+03	1704.000000	1704.000000
mean	1979.50000	2.960121e+07	59.474439	7215.327081
std	17.26533	1.061579e+08	12.917107	9857.454543
min	1952.00000	6.001100e+04	23.599000	241.165876
25%	1965.75000	2.793664e+06	48.198000	1202.060309
50%	1979.50000	7.023596e+06	60.712500	3531.846988
75%	1993.25000	1.958522e+07	70.845500	9325.462346
max	2007.00000	1.318683e+09	82.603000	113523.132900

#4
print(gap.mean())

year         1.979500e+03
pop          2.960121e+07
lifeExp      5.947444e+01
gdpPercap    7.215327e+03
dtype: float64

/var/folders/9g/fhnd1v790cj5ccxlv4rcvsy40000gq/T/ipykernel_76196/1146573181.py:2: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  print(gap.mean())

#5
gap["continent"].value_counts()

Africa      624
Asia        396
Europe      360
Americas    300
Oceania      24
Name: continent, dtype: int64

#6
le_gdp_by_continent = gap.groupby("continent").agg(mean_le = ("lifeExp", "mean"), 
                                                  mean_gdp = ("gdpPercap", "mean"))
le_gdp_by_continent

	mean_le	mean_gdp
continent
Africa	48.865330	2193.754578
Americas	64.658737	7136.110356
Asia	60.064903	7902.150428
Europe	71.903686	14469.475533
Oceania	74.326208	18621.609223

#7 
gap_75_1000 = gap[(gap["lifeExp"] > 75) & (gap["gdpPercap"] < 5000)]
gap_75_1000

	country	year	pop	continent	lifeExp	gdpPercap
22	Albania	2002	3508512.0	Europe	75.651	4604.211737

Chapter 5 - Penguins Exercises#

Learn more about the biological and spatial characteristics of penguins!

Use seaborn to make a scatterplot of two continuous variables. Color each point by species.
Make the same scatterplot as #1 above. This time, color each point by sex.
Make the same scatterplot as #1 above again. This time color each point by island.
Use the sns.FacetGrid method to make faceted plots to examine “flipper_length_mm” on the x-axis, and “body_mass_g” on the y-axis.

import pandas as pd
import seaborn as sns
# !wget -P data/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/penguins.csv
peng = pd.read_csv("data/penguins.csv")

# set seaborn figure size, background theme, and axis and tick label size
sns.set(rc={'figure.figsize':(10, 7)})
sns.set(font_scale = 2)
sns.set_theme(style='ticks')

peng

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	MALE
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	FEMALE
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	FEMALE
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	FEMALE
...	...	...	...	...	...	...	...
339	Gentoo	Biscoe	NaN	NaN	NaN	NaN	NaN
340	Gentoo	Biscoe	46.8	14.3	215.0	4850.0	FEMALE
341	Gentoo	Biscoe	50.4	15.7	222.0	5750.0	MALE
342	Gentoo	Biscoe	45.2	14.8	212.0	5200.0	FEMALE
343	Gentoo	Biscoe	49.9	16.1	213.0	5400.0	MALE

344 rows × 7 columns

#1
sns.scatterplot(data = peng, x = "flipper_length_mm", y = "body_mass_g", 
                hue = "species",
               s = 250, alpha = 0.75);

#2
sns.scatterplot(data = peng, x = "flipper_length_mm", y = "body_mass_g", 
                hue = "sex",
               s = 250, alpha = 0.75, 
                
               palette = ["red", "green"]).legend(title = "Species",
                                                  fontsize = 20, 
                                                  title_fontsize = 30,
                                                 loc = "best");

#3
sns.scatterplot(data = peng, x = "flipper_length_mm", y = "body_mass_g", 
                hue = "island").legend(loc = "lower right");

#4
facet_plot = sns.FacetGrid(data = peng, col = "island",  row = "sex")
facet_plot.map(sns.scatterplot, "flipper_length_mm", "body_mass_g");

Chapter 5 - Gapminder Exercises#

Figure out how to make a line plot that shows gdpPercap through time.
Figure out how to make a second line plot that shows lifeExp through time.
How can you plot gdpPercap with a different colored line for each continent?
Plot lifeExp with a different colored line for each continent.

import pandas as pd
import seaborn as sns
# !wget -P data/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/gapminder-FiveYearData.csv
gap = pd.read_csv("data/gapminder-FiveYearData.csv")

#1
sns.lineplot(data = gap, x = "year", y = "gdpPercap", ci = 95);

#2
sns.lineplot(data = gap, x = "year", y = "lifeExp", ci = False);

#3
sns.lineplot(data = gap, x = "year", y = "gdpPercap", hue = "continent", ci = False);

#4
sns.lineplot(data = gap, x = "year", y = "lifeExp", 
             hue = "continent", ci = False);

#4 with custom colors
sns.lineplot(data = gap, x = "year", y = "lifeExp", 
             hue = "continent", 
             ci = False, 
            palette = ["#00FFFF", "#458B74", "#E3CF57", "#8A2BE2", "#CD3333"]);

# color hex codes: https://www.webucator.com/article/python-color-constants-module/
# seaborn color palettes: https://www.reddit.com/r/visualization/comments/qc0b36/all_seaborn_color_palettes_together_so_you_dont/

Exercise - scikit learn’s `LinearRegression()` function#

Compare our “by hand” OLS results to those producd by sklearn’s LinearRegression function. Are they the same?
- Slope = 4
- Intercept = -4
- RMSE = 2.82843
- y_hat = y_hat = B0 + B1 * data.x

#1 
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Recreate dataset
import pandas as pd
data = pd.DataFrame({"x": [1,2,3,4,5],
                     "y": [2,4,6,8,20]})
data

	x	y
0	1	2
1	2	4
2	3	6
3	4	8
4	5	20

# Our "by hand" OLS regression information:
B1 = 4
B0 = -4
RMSE = 2.82843
y_hat = B0 + B1 * data.x

# use scikit-learn to compute R-squared value
lin_mod = LinearRegression().fit(data[['x']], data[['y']])
print("R-squared: " + str(lin_mod.score(data[['x']], data[['y']])))

R-squared: 0.8

# use scikit-learn to compute slope and intercept
print("scikit-learn slope: " + str(lin_mod.coef_))
print("scikit-learn intercept: " + str(lin_mod.intercept_))

scikit-learn slope: [[4.]]
scikit-learn intercept: [-4.]

# compare to our by "hand" versions. Both are the same!
print(int(lin_mod.coef_) == B1)
print(int(lin_mod.intercept_) == B0)

True
True

# use scikit-learn to compute RMSE
RMSE_scikit = round(mean_squared_error(data.y, y_hat, squared = False), 5)
print(RMSE_scikit)

2.82843

# Does our hand-computed RMSE equal that of scikit-learn at 5 digits?? Yes!
print(round(RMSE, 5) == round(RMSE_scikit, 5))

True

Chapter 7 - Exercises - redwoods webscraping#

This also works with data scraped from the web. Below is very brief BeautifulSoup example to save the contents of the Sequoioideae (redwood trees) Wikipedia page in a variable named text.

Read through the code below
Practice by repeating for a webpage of your choice

#1 
# See 7_English_preprocessing_basics.ipynb

#2
from bs4 import BeautifulSoup
import requests
import regex as re
import nltk

url = "https://en.wikipedia.org/wiki/Observable_universe"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html')

text = ""

for paragraph in soup.find_all('p'):
    text += paragraph.text

text = re.sub(r'\[[0-9]*\]',' ',text)
text = re.sub(r'\s+',' ',text)
text = re.sub(r'\d',' ',text)
text = re.sub(r'[^\w\s]','',text)
text = text.lower()
text = re.sub(r'\s+',' ',text)

# print(text)

Chapter 7 - Exercise - Dracula versus Frankenstein#

Practice your text pre-processing skills on the classic novel Dracula! Here you’ll just be performing the standardization operations on a text string instead of a DataFrame, so be sure to adapt the practices you saw with the UN HRC corpus processing appropriately.

Can you:
- Remove non-alphanumeric characters & punctuation?
- Remove digits?
- Remove unicode characters?
- Remove extraneous spaces?
- Standardize casing?
- Lemmatize tokens?
Investigate classic horror novel vocabulary. Create a single TF-IDF sparse matrix that contains the vocabulary for Frankenstein and Dracula. You should only have two rows (one for each of these novels), but potentially thousands of columns to represent the vocabulary across the two texts. What are the 20 most unique words in each? Make a dataframe or visualization to illustrate the differences.
Read through this 20 newsgroups dataset example to get familiar with newspaper data. Do you best to understand and explain what is happening at each step of the workflow. “The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.”

# 1
import regex as re
from string import punctuation
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import pandas as pd
from collections import Counter
import seaborn as sns

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/evanmuzzall/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Import dracula.txt#

# !wget -P data/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/dracula.txt
text = open("data/dracula.txt").read()

# print just the first 100 characters
print(text[:100])

The Project Gutenberg eBook of Dracula, by Bram Stoker

This eBook is for the use of anyone anywhere

Standardize Text#

Casing and spacing#

Oftentimes in text analysis, identifying occurences of key word(s) is a necessary step. To do so, we may want “apple,” “ApPLe,” and “apple ” to be treated the same; i.e., as an occurence of the token, ‘apple.’ To achieve this, we can standardize text casing and spacing:

# Converting all charazcters in a string to lowercase only requires one method: 
message = "Hello! Welcome      to        TAML!"
print(message.lower())

# To replace instances of multiple spaces with one, we can use the regex module's 'sub' function:
# Documentation on regex can be found at: https://docs.python.org/3/library/re.html
single_spaces_msg = re.sub('\s+', ' ', message)
print(single_spaces_msg)

hello! welcome      to        taml!
Hello! Welcome to TAML!

Remove punctuation#

Remember that Python methods can be chained together.

Below, a standard for loop loops through the punctuation module to replace any of these characters with nothing.

print(punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

for char in punctuation:
    text = text.lower().replace(char, "")

print(text[:100])

the project gutenberg ebook of dracula by bram stoker

this ebook is for the use of anyone anywhere 

Tokenize the text#

Split each word on spaces.

# .split() returns a list of the tokens in a string, separated by the specified delimiter (default: " ")
tokens = text.split()

# View the first 20
print(tokens[:20])

['the', 'project', 'gutenberg', 'ebook', 'of', 'dracula', 'by', 'bram', 'stoker', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the']

Remove stop words#

Below is a list comprehension (a sort of shortcut for loop, or chunk of repeating code) that can accomplish this task for us.

filtered_text = [word for word in tokens if word not in stopwords.words('english')]

# show only the first 100 words
# do you see any stopwords?
print(filtered_text[:100])

['project', 'gutenberg', 'ebook', 'dracula', 'bram', 'stoker', 'ebook', 'use', 'anyone', 'anywhere', 'united', 'states', 'parts', 'world', 'cost', 'almost', 'restrictions', 'whatsoever', 'may', 'copy', 'give', 'away', 'reuse', 'terms', 'project', 'gutenberg', 'license', 'included', 'ebook', 'online', 'wwwgutenbergorg', 'located', 'united', 'states', 'check', 'laws', 'country', 'located', 'using', 'ebook', 'title', 'dracula', 'author', 'bram', 'stoker', 'release', 'date', 'october', '1995', 'ebook', '345', 'recently', 'updated', 'september', '5', '2022', 'language', 'english', 'produced', 'chuck', 'greif', 'online', 'distributed', 'proofreading', 'team', 'start', 'project', 'gutenberg', 'ebook', 'dracula', 'dracula', 'bram', 'stoker', 'illustration', 'colophon', 'new', 'york', 'grosset', 'dunlap', 'publishers', 'copyright', '1897', 'united', 'states', 'america', 'according', 'act', 'congress', 'bram', 'stoker', 'rights', 'reserved', 'printed', 'united', 'states', 'country', 'life', 'press', 'garden', 'city']

Lemmatizing/Stemming tokens#

Lemmatizating and stemming are related, but are different practices. Both aim to reduce the inflectional forms of a token to a common base/root. However, how they go about doing so is the key differentiating factor.

Stemming operates by removes the prefixs and/or suffixes of a word. Examples include:

flooding to flood
studies to studi
risky to risk

Lemmatization attempts to contextualize a word, arriving at it’s base meaning. Lemmatization reductions can occur across various dimensions of speech. Examples include:

Plural to singular (corpora to corpus)
Condition (better to good)
Gerund (running to run)

One technique is not strictly better than the other - it’s a matter of project needs and proper application.

stmer = nltk.PorterStemmer()

lmtzr = nltk.WordNetLemmatizer()

# do you see any differences?
token_stem  = [ stmer.stem(token) for token in filtered_text]

token_lemma = [ lmtzr.lemmatize(token) for token in filtered_text ]

print(token_stem[:20])

print(token_lemma[:20])

['project', 'gutenberg', 'ebook', 'dracula', 'bram', 'stoker', 'ebook', 'use', 'anyon', 'anywher', 'unit', 'state', 'part', 'world', 'cost', 'almost', 'restrict', 'whatsoev', 'may', 'copi']
['project', 'gutenberg', 'ebook', 'dracula', 'bram', 'stoker', 'ebook', 'use', 'anyone', 'anywhere', 'united', 'state', 'part', 'world', 'cost', 'almost', 'restriction', 'whatsoever', 'may', 'copy']

Part of speech tags#

Part of speech tags are labels given to each word in a text such as verbs, adverbs, nouns, pronouns, adjectives, conjunctions, and their various derivations and subcategories.

tagged = nltk.pos_tag(token_lemma)

# Let's see a quick example: 
ex_string = 'They refuse to permit us to obtain the refuse permit.'
print(nltk.pos_tag(ex_string.split())) 

[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit.', 'NN')]

The output of .pos_tag is a list of tuples (immutable pairs), where the first element is a text token and the second is a part of speech. Note that, in our example string, the token ‘refuse’ shows up twice - once as a verb, and once as a noun. In the output to .pos_tag, the first tuple with ‘refuse’ has the ‘VBP’ tag (present tense verb) and the second tuple has the ‘NN’ tag (noun). Nifty!

chunked = nltk.chunk.ne_chunk(tagged)

Convert to dataframe#

df = pd.DataFrame(chunked, columns=['word', 'pos'])
df.head(n = 10)

	word	pos
0	project	NN
1	gutenberg	NN
2	ebook	NN
3	dracula	NN
4	bram	NN
5	stoker	NN
6	ebook	NN
7	use	NN
8	anyone	NN
9	anywhere	RB

df.shape

(73541, 2)

Visualize the 20 most frequent words#

top = df.copy()

count_words = Counter(top['word'])
count_words.most_common()[:20]

[('said', 569),
 ('one', 509),
 ('could', 493),
 ('u', 463),
 ('must', 451),
 ('would', 428),
 ('shall', 427),
 ('time', 425),
 ('know', 420),
 ('may', 416),
 ('see', 398),
 ('come', 377),
 ('van', 322),
 ('hand', 310),
 ('came', 307),
 ('helsing', 300),
 ('went', 298),
 ('lucy', 296),
 ('go', 296),
 ('like', 278)]

words_df = pd.DataFrame(count_words.items(), columns=['word', 'count']).sort_values(by = 'count', ascending=False)
words_df[:20]

	word	count
205	said	569
252	one	509
151	could	493
176	u	463
315	must	451
158	would	428
274	shall	427
161	time	425
220	know	420
17	may	416
378	see	398
680	come	377
120	van	322
1165	hand	310
184	came	307
121	helsing	300
542	went	298
155	go	296
100	lucy	296
403	like	278

# What would you need to do to improve an approach to word visualization such as this one?
top_plot = sns.barplot(x = 'word', y = 'count', data = words_df[:20])
top_plot.set_xticklabels(top_plot.get_xticklabels(),rotation = 40);

#2
import spacy
import regex as re
from sklearn.feature_extraction.text import TfidfVectorizer

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [71], in <cell line: 2>()
      1 #2
----> 2 import spacy
      3 import regex as re
      4 from sklearn.feature_extraction.text import TfidfVectorizer

ModuleNotFoundError: No module named 'spacy'

# Create a new directory to house the two novels
!mkdir data/novels/

# Download the two novels
# !wget -P data/novels/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/dracula.txt
# !wget -P data/novels/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/frankenstein.txt

mkdir: data/novels/: File exists

# See that they are there!
!ls data/novels

dracula.txt      frankenstein.txt

import os
corpus = os.listdir('data/novels/')

# View the contents of this directory
corpus

['frankenstein.txt', 'dracula.txt']

empty_dictionary = {}

# Loop through the folder of documents to open and read each one
for document in corpus:
    with open('data/novels/' + document, 'r', encoding = 'utf-8') as to_open:
         empty_dictionary[document] = to_open.read()

# Populate the data frame with two columns: file name and document text
novels = (pd.DataFrame.from_dict(empty_dictionary, 
                                       orient = 'index')
                .reset_index().rename(index = str, 
                                      columns = {'index': 'file_name', 0: 'document_text'}))

novels

	file_name	document_text
0	frankenstein.txt	The Project Gutenberg eBook of Frankenstein, b...
1	dracula.txt	The Project Gutenberg eBook of Dracula, by Bra...

novels['clean_text'] = novels['document_text'].str.replace(r'[^\w\s]', ' ', regex = True)
novels

	file_name	document_text	clean_text
0	frankenstein.txt	The Project Gutenberg eBook of Frankenstein, b...	The Project Gutenberg eBook of Frankenstein b...
1	dracula.txt	The Project Gutenberg eBook of Dracula, by Bra...	The Project Gutenberg eBook of Dracula by Bra...

novels['clean_text'] = novels['clean_text'].str.replace(r'\d', ' ', regex = True)
novels

	file_name	document_text	clean_text
0	frankenstein.txt	The Project Gutenberg eBook of Frankenstein, b...	The Project Gutenberg eBook of Frankenstein b...
1	dracula.txt	The Project Gutenberg eBook of Dracula, by Bra...	The Project Gutenberg eBook of Dracula by Bra...

novels['clean_text'] = novels['clean_text'].str.encode('ascii', 'ignore').str.decode('ascii')
novels

	file_name	document_text	clean_text
0	frankenstein.txt	The Project Gutenberg eBook of Frankenstein, b...	The Project Gutenberg eBook of Frankenstein b...
1	dracula.txt	The Project Gutenberg eBook of Dracula, by Bra...	The Project Gutenberg eBook of Dracula by Bra...

novels['clean_text'] = novels['clean_text'].str.replace(r'\s+', ' ', regex = True)
novels

	file_name	document_text	clean_text
0	frankenstein.txt	The Project Gutenberg eBook of Frankenstein, b...	The Project Gutenberg eBook of Frankenstein by...
1	dracula.txt	The Project Gutenberg eBook of Dracula, by Bra...	The Project Gutenberg eBook of Dracula by Bram...

novels['clean_text'] = novels['clean_text'].str.lower()
novels

	file_name	document_text	clean_text
0	frankenstein.txt	The Project Gutenberg eBook of Frankenstein, b...	the project gutenberg ebook of frankenstein by...
1	dracula.txt	The Project Gutenberg eBook of Dracula, by Bra...	the project gutenberg ebook of dracula by bram...

# !python -m spacy download en_core_web_sm

nlp = spacy.load('en_core_web_sm')
novels['clean_text'] = novels['clean_text'].apply(lambda row: ' '.join([w.lemma_ for w in nlp(row)]))
novels

	file_name	document_text	clean_text
0	frankenstein.txt	The Project Gutenberg eBook of Frankenstein, b...	the project gutenberg ebook of frankenstein by...
1	dracula.txt	The Project Gutenberg eBook of Dracula, by Bra...	the project gutenberg ebook of dracula by bram...

tf_vectorizer = TfidfVectorizer(ngram_range = (1, 3), 
                                stop_words = 'english', 
                                max_df = 0.50
                                )
tf_sparse = tf_vectorizer.fit_transform(novels['clean_text'])

tf_sparse.shape

(2, 168048)

tfidf_df = pd.DataFrame(tf_sparse.todense(), columns = tf_vectorizer.get_feature_names())
tfidf_df

/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)

	aback	aback moment	aback moment know	abaft	abaft bi	abaft bi bank	abaft krok	abaft krok hooal	abandon abortion	abandon abortion spurn	...	zophagous life eat	zophagous patient	zophagous patient effect	zophagous patient outburst	zophagous patient report	zophagous wild	zophagous wild raving	zophagy	zophagy puzzle	zophagy puzzle little
0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.002998	0.002998	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
1	0.001011	0.001011	0.001011	0.002021	0.001011	0.001011	0.001011	0.001011	0.000000	0.000000	...	0.001011	0.003032	0.001011	0.001011	0.001011	0.001011	0.001011	0.001011	0.001011	0.001011

2 rows × 168048 columns

tfidf_df.max().sort_values(ascending = False).head(n = 20)

van            0.326435
helsing        0.310265
van helsing    0.310265
lucy           0.304201
elizabeth      0.275854
mina           0.246595
jonathan       0.210212
count          0.202127
dr             0.191010
harker         0.178882
clerval        0.176906
justine        0.164913
felix          0.149920
seward         0.140478
diary          0.120265
dr seward      0.118244
perceive       0.116938
box            0.116223
geneva         0.107943
misfortune     0.098948
dtype: float64

titles = novels['file_name'].str.slice(stop = -4)
titles = list(titles)
titles

['frankenstein', 'dracula']

tfidf_df['TITLE'] = titles
tfidf_df

	aback	aback moment	aback moment know	abaft	abaft bi	abaft bi bank	abaft krok	abaft krok hooal	abandon abortion	abandon abortion spurn	...	zophagous patient	zophagous patient effect	zophagous patient outburst	zophagous patient report	zophagous wild	zophagous wild raving	zophagy	zophagy puzzle	zophagy puzzle little	TITLE
0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.002998	0.002998	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	frankenstein
1	0.001011	0.001011	0.001011	0.002021	0.001011	0.001011	0.001011	0.001011	0.000000	0.000000	...	0.003032	0.001011	0.001011	0.001011	0.001011	0.001011	0.001011	0.001011	0.001011	dracula

2 rows × 168049 columns

# dracula top 20 words
title = tfidf_df[tfidf_df['TITLE'] == 'frankenstein']
title.max(numeric_only = True).sort_values(ascending = False).head(20)

elizabeth                  0.275854
clerval                    0.176906
justine                    0.164913
felix                      0.149920
perceive                   0.116938
geneva                     0.107943
misfortune                 0.098948
frankenstein               0.092951
beheld                     0.083955
victor                     0.083955
murderer                   0.080957
henry                      0.077959
cousin                     0.077959
safie                      0.074960
william                    0.074960
cottager                   0.071962
chapter chapter            0.068963
chapter chapter chapter    0.065965
creator                    0.062967
exclaim                    0.059968
dtype: float64

# dracula top 20 words
title = tfidf_df[tfidf_df['TITLE'] == 'dracula']
title.max(numeric_only = True).sort_values(ascending = False).head(20)

van            0.326435
helsing        0.310265
van helsing    0.310265
lucy           0.304201
mina           0.246595
jonathan       0.210212
count          0.202127
dr             0.191010
harker         0.178882
seward         0.140478
diary          0.120265
dr seward      0.118244
box            0.116223
sort           0.096010
madam          0.095000
madam mina     0.087925
don            0.086915
quincey        0.085904
godalming      0.079840
morris         0.078829
dtype: float64

Chapter 8 - Exercise#

Read through the spacy101 guide and begin to apply its principles to your own corpus: https://spacy.io/usage/spacy-101

Chapter 9 - Exercise#

Repeat the steps in this notebook with your own data. However, real data does not come with a fetch function. What importation steps do you need to consider so your own corpus works?

Text Analysis and Machine Learning (TAML) Group

Solutions

Contents

Solutions#

Chapter 1 - Exercises#

Chapter 2 - Exercises#

Chapter 3 - Exercises#

Chapter 4 - Exercises#

Chapter 5 - Penguins Exercises#

Chapter 5 - Gapminder Exercises#

Exercise - scikit learn’s `LinearRegression()` function#

Chapter 7 - Exercises - redwoods webscraping#

Chapter 7 - Exercise - Dracula versus Frankenstein#

Import dracula.txt#

Standardize Text#

Casing and spacing#

Remove punctuation#

Tokenize the text#

Remove stop words#

Lemmatizing/Stemming tokens#

Part of speech tags#

Convert to dataframe#

Visualize the 20 most frequent words#

Chapter 8 - Exercise#

Chapter 9 - Exercise#

Text Analysis and Machine Learning (TAML) Group

Solutions

Contents

Solutions#

Chapter 1 - Exercises#

Chapter 2 - Exercises#

Chapter 3 - Exercises#

Chapter 4 - Exercises#

Chapter 5 - Penguins Exercises#

Chapter 5 - Gapminder Exercises#

Exercise - scikit learn’s LinearRegression() function#

Chapter 7 - Exercises - redwoods webscraping#

Chapter 7 - Exercise - Dracula versus Frankenstein#

Import dracula.txt#

Standardize Text#

Casing and spacing#

Remove punctuation#

Tokenize the text#

Remove stop words#

Lemmatizing/Stemming tokens#

Part of speech tags#

Convert to dataframe#

Visualize the 20 most frequent words#

Chapter 8 - Exercise#

Chapter 9 - Exercise#

Exercise - scikit learn’s `LinearRegression()` function#