Solutions
Contents
Solutions#
2023 April 20
Example solutions for challenge exercises from each chapter in this book.Chapter 1 - Exercises#
You will find challenge exercises to work on at the end of each chapter. They will require you to write code such as that found in the cell at the top of this notebook.
Click the “Colab” badge at the top of this notebook to open it in the Colaboratory environment. Press
shift
andenter
simultaneously on your keyboard to run the code and draw your lucky card!
Remember: Press
shift
andenter
on your keyboard to run a cell.
# import necessary librarys to make the code work
import random
import calendar
from datetime import date, datetime
# define the deck and suits as character strings and split them on the spaces
deck = 'ace two three four five six seven eight nine ten jack queen king'.split()
suit = 'spades clubs hearts diamonds'.split()
print(deck)
print(suit)
['ace', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'jack', 'queen', 'king']
['spades', 'clubs', 'hearts', 'diamonds']
# define today's day and date
today = calendar.day_name[date.today().weekday()]
date = datetime.today().strftime('%Y-%m-%d')
print(today)
print(date)
Saturday
2023-05-20
# randomly sample the card value and suit
select_value = random.sample(deck, 1)[0]
select_suit = random.sample(suit, 1)[0]
print(select_value)
print(select_suit)
eight
spades
# combine the character strings and variables into the final statement
print("\nWelcome to TAML at SSDS!")
print("\nYour lucky card for " + today + " " + date + " is: " + select_value + " of " + select_suit)
Welcome to TAML at SSDS!
Your lucky card for Saturday 2023-05-20 is: eight of spades
Chapter 2 - Exercises#
(Required) Set up your Google Colaboratory (Colab) environment following the instructions in #1 listed above.
(Optional) Check that you can correctly open these notebooks in Jupyter Lab.
(Optional) Install Python Anaconda distribution on your machine.
See 2_Python_environments.ipynb for instructions.
Chapter 3 - Exercises#
Define one variablez for each of the four data types introduced above: 1) string, 2) boolean, 3) float, and 4) integer.
Define two lists that contain four elements each.
Define a dictionary that containts the two lists from #2 above.
Import the file “dracula.txt”. Save it in a variable named
drac
Import the file “penguins.csv”. Save it in a variable named
pen
Figure out how to find help to export just the first 1000 characters of
drac
as a .txt file named “dracula_short.txt”Figure out how to export the
pen
dataframe as a file named “penguins_saved.csv”
If you encounter error messages, which ones?
#1
string1 = "Hello!"
string2 = "This is a sentence."
print(string1)
print(string2)
Hello!
This is a sentence.
bool1 = True
bool2 = False
print(bool1)
print(bool2)
True
False
float1 = 3.14
float2 = 12.345
print(float1)
print(float2)
3.14
12.345
integer1 = 8
integer2 = 4356
print(integer1)
print(integer2)
8
4356
#2
list1 = [integer2, string2, float1, "My name is:"]
list2 = [3, True, "What?", string1]
print(list1)
print(list2)
[4356, 'This is a sentence.', 3.14, 'My name is:']
[3, True, 'What?', 'Hello!']
#3
dict_one = {"direction": "up",
"code": 1234,
"first_list": list1,
"second_list": list2}
dict_one
{'direction': 'up',
'code': 1234,
'first_list': [4356, 'This is a sentence.', 3.14, 'My name is:'],
'second_list': [3, True, 'What?', 'Hello!']}
#4
# !wget -P data/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/spring2023/data/dracula.txt
drac = open("data/dracula.txt").read()
# print(drac)
#5
import pandas as pd
# !wget -P data/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/spring2023/data/penguins.csv
pen = pd.read_csv("data/penguins.csv")
pen
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | MALE |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | FEMALE |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | MALE |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | FEMALE |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | MALE |
344 rows × 7 columns
#6
# first slice the string you want to save
drac_short = drac[:1000]
# second, open in write mode and write the file to the data directory!
with open('data/dracula_short.txt', 'w', encoding='utf-8') as f:
f.write(drac_short)
# You can also copy files from Colab to your Google Drive
# Mount your GDrive
# from google.colab import drive
# drive.mount('/content/drive')
# Copy a file from Colab to GDrive
# !cp data/dracula_short.txt /content/drive/MyDrive
#7
pen.to_csv("data/penguins_saved.csv")
# !cp data/penguins_saved.csv /content/drive/MyDrive
Chapter 4 - Exercises#
Load the file “gapminder-FiveYearData.csv” and save it in a variable named
gap
Print the column names
Compute the mean for one numeric column
Compute the mean for all numeric columns
Tabulate frequencies for the “continent” column
Compute mean lifeExp and dgpPercap by continent
Create a subset of
gap
that contains only countries with lifeExp greater than 75 and gdpPercap less than 5000.
#1
import pandas as pd
# !wget -P data/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/spring2023/data/gapminder-FiveYearData.csv
gap = pd.read_csv("data/gapminder-FiveYearData.csv")
gap
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
0 | Afghanistan | 1952 | 8425333.0 | Asia | 28.801 | 779.445314 |
1 | Afghanistan | 1957 | 9240934.0 | Asia | 30.332 | 820.853030 |
2 | Afghanistan | 1962 | 10267083.0 | Asia | 31.997 | 853.100710 |
3 | Afghanistan | 1967 | 11537966.0 | Asia | 34.020 | 836.197138 |
4 | Afghanistan | 1972 | 13079460.0 | Asia | 36.088 | 739.981106 |
... | ... | ... | ... | ... | ... | ... |
1699 | Zimbabwe | 1987 | 9216418.0 | Africa | 62.351 | 706.157306 |
1700 | Zimbabwe | 1992 | 10704340.0 | Africa | 60.377 | 693.420786 |
1701 | Zimbabwe | 1997 | 11404948.0 | Africa | 46.809 | 792.449960 |
1702 | Zimbabwe | 2002 | 11926563.0 | Africa | 39.989 | 672.038623 |
1703 | Zimbabwe | 2007 | 12311143.0 | Africa | 43.487 | 469.709298 |
1704 rows × 6 columns
#2
gap.columns
Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')
#3
gap["lifeExp"].mean()
59.47443936619714
# or
gap.describe()
year | pop | lifeExp | gdpPercap | |
---|---|---|---|---|
count | 1704.00000 | 1.704000e+03 | 1704.000000 | 1704.000000 |
mean | 1979.50000 | 2.960121e+07 | 59.474439 | 7215.327081 |
std | 17.26533 | 1.061579e+08 | 12.917107 | 9857.454543 |
min | 1952.00000 | 6.001100e+04 | 23.599000 | 241.165876 |
25% | 1965.75000 | 2.793664e+06 | 48.198000 | 1202.060309 |
50% | 1979.50000 | 7.023596e+06 | 60.712500 | 3531.846988 |
75% | 1993.25000 | 1.958522e+07 | 70.845500 | 9325.462346 |
max | 2007.00000 | 1.318683e+09 | 82.603000 | 113523.132900 |
#4
print(gap.mean())
year 1.979500e+03
pop 2.960121e+07
lifeExp 5.947444e+01
gdpPercap 7.215327e+03
dtype: float64
/var/folders/9g/fhnd1v790cj5ccxlv4rcvsy40000gq/T/ipykernel_22032/1146573181.py:2: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
print(gap.mean())
#5
gap["continent"].value_counts()
Africa 624
Asia 396
Europe 360
Americas 300
Oceania 24
Name: continent, dtype: int64
#6
le_gdp_by_continent = gap.groupby("continent").agg(mean_le = ("lifeExp", "mean"),
mean_gdp = ("gdpPercap", "mean"))
le_gdp_by_continent
mean_le | mean_gdp | |
---|---|---|
continent | ||
Africa | 48.865330 | 2193.754578 |
Americas | 64.658737 | 7136.110356 |
Asia | 60.064903 | 7902.150428 |
Europe | 71.903686 | 14469.475533 |
Oceania | 74.326208 | 18621.609223 |
#7
gap_75_1000 = gap[(gap["lifeExp"] > 75) & (gap["gdpPercap"] < 5000)]
gap_75_1000
country | year | pop | continent | lifeExp | gdpPercap | |
---|---|---|---|---|---|---|
22 | Albania | 2002 | 3508512.0 | Europe | 75.651 | 4604.211737 |
Chapter 5 - Penguins Exercises#
Learn more about the biological and spatial characteristics of penguins!
Use seaborn to make a scatterplot of two continuous variables. Color each point by species.
Make the same scatterplot as #1 above. This time, color each point by sex.
Make the same scatterplot as #1 above again. This time color each point by island.
Use the
sns.FacetGrid
method to make faceted plots to examine “flipper_length_mm” on the x-axis, and “body_mass_g” on the y-axis.
import pandas as pd
import seaborn as sns
# !wget -P data/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/spring2023/data/penguins.csv
peng = pd.read_csv("data/penguins.csv")
/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
# set seaborn figure size, background theme, and axis and tick label size
sns.set(rc={'figure.figsize':(10, 7)})
sns.set(font_scale = 2)
sns.set_theme(style='ticks')
peng
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | MALE |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | FEMALE |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | MALE |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | FEMALE |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | MALE |
344 rows × 7 columns
#1
sns.scatterplot(data = peng, x = "flipper_length_mm", y = "body_mass_g",
hue = "species",
s = 250, alpha = 0.75);
#2
sns.scatterplot(data = peng, x = "flipper_length_mm", y = "body_mass_g",
hue = "sex",
s = 250, alpha = 0.75,
palette = ["red", "green"]).legend(title = "Species",
fontsize = 20,
title_fontsize = 30,
loc = "best");
#3
sns.scatterplot(data = peng, x = "flipper_length_mm", y = "body_mass_g",
hue = "island").legend(loc = "lower right");
#4
facet_plot = sns.FacetGrid(data = peng, col = "island", row = "sex")
facet_plot.map(sns.scatterplot, "flipper_length_mm", "body_mass_g");
Chapter 5 - Gapminder Exercises#
Figure out how to make a line plot that shows gdpPercap through time.
Figure out how to make a second line plot that shows lifeExp through time.
How can you plot gdpPercap with a different colored line for each continent?
Plot lifeExp with a different colored line for each continent.
import pandas as pd
import seaborn as sns
# !wget -P data/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/spring2023/data/gapminder-FiveYearData.csv
gap = pd.read_csv("data/gapminder-FiveYearData.csv")
#1
sns.lineplot(data = gap, x = "year", y = "gdpPercap", ci = 95);
#2
sns.lineplot(data = gap, x = "year", y = "lifeExp", ci = False);
#3
sns.lineplot(data = gap, x = "year", y = "gdpPercap", hue = "continent", ci = False);
#4
sns.lineplot(data = gap, x = "year", y = "lifeExp",
hue = "continent", ci = False);
#4 with custom colors
sns.lineplot(data = gap, x = "year", y = "lifeExp",
hue = "continent",
ci = False,
palette = ["#00FFFF", "#458B74", "#E3CF57", "#8A2BE2", "#CD3333"]);
# color hex codes: https://www.webucator.com/article/python-color-constants-module/
# seaborn color palettes: https://www.reddit.com/r/visualization/comments/qc0b36/all_seaborn_color_palettes_together_so_you_dont/
Exercise - scikit learn’s LinearRegression()
function#
Compare our “by hand” OLS results to those producd by sklearn’s
LinearRegression
function. Are they the same?Slope = 4
Intercept = -4
RMSE = 2.82843
y_hat = y_hat = B0 + B1 * data.x
#1
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Recreate dataset
import pandas as pd
data = pd.DataFrame({"x": [1,2,3,4,5],
"y": [2,4,6,8,20]})
data
x | y | |
---|---|---|
0 | 1 | 2 |
1 | 2 | 4 |
2 | 3 | 6 |
3 | 4 | 8 |
4 | 5 | 20 |
# Our "by hand" OLS regression information:
B1 = 4
B0 = -4
RMSE = 2.82843
y_hat = B0 + B1 * data.x
# use scikit-learn to compute R-squared value
lin_mod = LinearRegression().fit(data[['x']], data[['y']])
print("R-squared: " + str(lin_mod.score(data[['x']], data[['y']])))
R-squared: 0.7999999999999999
# use scikit-learn to compute slope and intercept
print("scikit-learn slope: " + str(lin_mod.coef_))
print("scikit-learn intercept: " + str(lin_mod.intercept_))
scikit-learn slope: [[4.]]
scikit-learn intercept: [-4.]
# compare to our by "hand" versions. Both are the same!
print(int(lin_mod.coef_) == B1)
print(int(lin_mod.intercept_) == B0)
True
True
# use scikit-learn to compute RMSE
RMSE_scikit = round(mean_squared_error(data.y, y_hat, squared = False), 5)
print(RMSE_scikit)
2.82843
# Does our hand-computed RMSE equal that of scikit-learn at 5 digits?? Yes!
print(round(RMSE, 5) == round(RMSE_scikit, 5))
True
Chapter 7 - Exercises - redwoods webscraping#
This also works with data scraped from the web. Below is very brief BeautifulSoup example to save the contents of the Sequoioideae (redwood trees) Wikipedia page in a variable named text
.
Read through the code below
Practice by repeating for a webpage of your choice
#1
# See 7_English_preprocessing_basics.ipynb
#2
from bs4 import BeautifulSoup
import requests
import regex as re
import nltk
url = "https://en.wikipedia.org/wiki/Observable_universe"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html')
text = ""
for paragraph in soup.find_all('p'):
text += paragraph.text
text = re.sub(r'\[[0-9]*\]',' ',text)
text = re.sub(r'\s+',' ',text)
text = re.sub(r'\d',' ',text)
text = re.sub(r'[^\w\s]','',text)
text = text.lower()
text = re.sub(r'\s+',' ',text)
# print(text)
Chapter 7 - Exercise - Dracula versus Frankenstein#
Practice your text pre-processing skills on the classic novel Dracula! Here you’ll just be performing the standardization operations on a text string instead of a DataFrame, so be sure to adapt the practices you saw with the UN HRC corpus processing appropriately.
Can you:
Remove non-alphanumeric characters & punctuation?
Remove digits?
Remove unicode characters?
Remove extraneous spaces?
Standardize casing?
Lemmatize tokens?
Investigate classic horror novel vocabulary. Create a single TF-IDF sparse matrix that contains the vocabulary for Frankenstein and Dracula. You should only have two rows (one for each of these novels), but potentially thousands of columns to represent the vocabulary across the two texts. What are the 20 most unique words in each? Make a dataframe or visualization to illustrate the differences.
Read through this 20 newsgroups dataset example to get familiar with newspaper data. Do you best to understand and explain what is happening at each step of the workflow. “The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.”
# 1
import regex as re
from string import punctuation
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import pandas as pd
from collections import Counter
import seaborn as sns
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/evanmuzzall/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
Import dracula.txt#
# !wget -P data/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/spring2023/data/dracula.txt
text = open("data/dracula.txt").read()
# print just the first 100 characters
print(text[:100])
The Project Gutenberg eBook of Dracula, by Bram Stoker
This eBook is for the use of anyone anywhere
Standardize Text#
Casing and spacing#
Oftentimes in text analysis, identifying occurences of key word(s) is a necessary step. To do so, we may want “apple,” “ApPLe,” and “apple ” to be treated the same; i.e., as an occurence of the token, ‘apple.’ To achieve this, we can standardize text casing and spacing:
# Converting all charazcters in a string to lowercase only requires one method:
message = "Hello! Welcome to TAML!"
print(message.lower())
# To replace instances of multiple spaces with one, we can use the regex module's 'sub' function:
# Documentation on regex can be found at: https://docs.python.org/3/library/re.html
single_spaces_msg = re.sub('\s+', ' ', message)
print(single_spaces_msg)
hello! welcome to taml!
Hello! Welcome to TAML!
Remove punctuation#
Remember that Python methods can be chained together.
Below, a standard for loop loops through the punctuation
module to replace any of these characters with nothing.
print(punctuation)
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
for char in punctuation:
text = text.lower().replace(char, "")
print(text[:100])
the project gutenberg ebook of dracula by bram stoker
this ebook is for the use of anyone anywhere
Tokenize the text#
Split each word on spaces.
# .split() returns a list of the tokens in a string, separated by the specified delimiter (default: " ")
tokens = text.split()
# View the first 20
print(tokens[:20])
['the', 'project', 'gutenberg', 'ebook', 'of', 'dracula', 'by', 'bram', 'stoker', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the']
Remove stop words#
Below is a list comprehension (a sort of shortcut for loop, or chunk of repeating code) that can accomplish this task for us.
filtered_text = [word for word in tokens if word not in stopwords.words('english')]
# show only the first 100 words
# do you see any stopwords?
print(filtered_text[:100])
['project', 'gutenberg', 'ebook', 'dracula', 'bram', 'stoker', 'ebook', 'use', 'anyone', 'anywhere', 'united', 'states', 'parts', 'world', 'cost', 'almost', 'restrictions', 'whatsoever', 'may', 'copy', 'give', 'away', 'reuse', 'terms', 'project', 'gutenberg', 'license', 'included', 'ebook', 'online', 'wwwgutenbergorg', 'located', 'united', 'states', 'check', 'laws', 'country', 'located', 'using', 'ebook', 'title', 'dracula', 'author', 'bram', 'stoker', 'release', 'date', 'october', '1995', 'ebook', '345', 'recently', 'updated', 'september', '5', '2022', 'language', 'english', 'produced', 'chuck', 'greif', 'online', 'distributed', 'proofreading', 'team', 'start', 'project', 'gutenberg', 'ebook', 'dracula', 'dracula', 'bram', 'stoker', 'illustration', 'colophon', 'new', 'york', 'grosset', 'dunlap', 'publishers', 'copyright', '1897', 'united', 'states', 'america', 'according', 'act', 'congress', 'bram', 'stoker', 'rights', 'reserved', 'printed', 'united', 'states', 'country', 'life', 'press', 'garden', 'city']
Lemmatizing/Stemming tokens#
Lemmatizating and stemming are related, but are different practices. Both aim to reduce the inflectional forms of a token to a common base/root. However, how they go about doing so is the key differentiating factor.
Stemming operates by removes the prefixs and/or suffixes of a word. Examples include:
flooding to flood
studies to studi
risky to risk
Lemmatization attempts to contextualize a word, arriving at it’s base meaning. Lemmatization reductions can occur across various dimensions of speech. Examples include:
Plural to singular (corpora to corpus)
Condition (better to good)
Gerund (running to run)
One technique is not strictly better than the other - it’s a matter of project needs and proper application.
stmer = nltk.PorterStemmer()
lmtzr = nltk.WordNetLemmatizer()
# do you see any differences?
token_stem = [ stmer.stem(token) for token in filtered_text]
token_lemma = [ lmtzr.lemmatize(token) for token in filtered_text ]
print(token_stem[:20])
print(token_lemma[:20])
['project', 'gutenberg', 'ebook', 'dracula', 'bram', 'stoker', 'ebook', 'use', 'anyon', 'anywher', 'unit', 'state', 'part', 'world', 'cost', 'almost', 'restrict', 'whatsoev', 'may', 'copi']
['project', 'gutenberg', 'ebook', 'dracula', 'bram', 'stoker', 'ebook', 'use', 'anyone', 'anywhere', 'united', 'state', 'part', 'world', 'cost', 'almost', 'restriction', 'whatsoever', 'may', 'copy']
Convert to dataframe#
df = pd.DataFrame(chunked, columns=['word', 'pos'])
df.head(n = 10)
word | pos | |
---|---|---|
0 | project | NN |
1 | gutenberg | NN |
2 | ebook | NN |
3 | dracula | NN |
4 | bram | NN |
5 | stoker | NN |
6 | ebook | NN |
7 | use | NN |
8 | anyone | NN |
9 | anywhere | RB |
df.shape
(73541, 2)
Visualize the 20 most frequent words#
top = df.copy()
count_words = Counter(top['word'])
count_words.most_common()[:20]
[('said', 569),
('one', 509),
('could', 493),
('u', 463),
('must', 451),
('would', 428),
('shall', 427),
('time', 425),
('know', 420),
('may', 416),
('see', 398),
('come', 377),
('van', 322),
('hand', 310),
('came', 307),
('helsing', 300),
('went', 298),
('lucy', 296),
('go', 296),
('like', 278)]
words_df = pd.DataFrame(count_words.items(), columns=['word', 'count']).sort_values(by = 'count', ascending=False)
words_df[:20]
word | count | |
---|---|---|
205 | said | 569 |
252 | one | 509 |
151 | could | 493 |
176 | u | 463 |
315 | must | 451 |
158 | would | 428 |
274 | shall | 427 |
161 | time | 425 |
220 | know | 420 |
17 | may | 416 |
378 | see | 398 |
680 | come | 377 |
120 | van | 322 |
1165 | hand | 310 |
184 | came | 307 |
121 | helsing | 300 |
542 | went | 298 |
155 | go | 296 |
100 | lucy | 296 |
403 | like | 278 |
# What would you need to do to improve an approach to word visualization such as this one?
top_plot = sns.barplot(x = 'word', y = 'count', data = words_df[:20])
top_plot.set_xticklabels(top_plot.get_xticklabels(),rotation = 40);
#2
import spacy
import regex as re
from sklearn.feature_extraction.text import TfidfVectorizer
2023-05-20 14:52:13.982555: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
# Create a new directory to house the two novels
!mkdir data/novels/
# Download the two novels
# !wget -P data/novels/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/spring2023/data/dracula.txt
# !wget -P data/novels/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/spring2023/data/frankenstein.txt
mkdir: data/novels/: File exists
# See that they are there!
!ls data/novels
dracula.txt frankenstein.txt
import os
corpus = os.listdir('data/novels/')
# View the contents of this directory
corpus
['frankenstein.txt', 'dracula.txt']
empty_dictionary = {}
# Loop through the folder of documents to open and read each one
for document in corpus:
with open('data/novels/' + document, 'r', encoding = 'utf-8') as to_open:
empty_dictionary[document] = to_open.read()
# Populate the data frame with two columns: file name and document text
novels = (pd.DataFrame.from_dict(empty_dictionary,
orient = 'index')
.reset_index().rename(index = str,
columns = {'index': 'file_name', 0: 'document_text'}))
novels
file_name | document_text | |
---|---|---|
0 | frankenstein.txt | The Project Gutenberg eBook of Frankenstein, b... |
1 | dracula.txt | The Project Gutenberg eBook of Dracula, by Bra... |
novels['clean_text'] = novels['document_text'].str.replace(r'[^\w\s]', ' ', regex = True)
novels
file_name | document_text | clean_text | |
---|---|---|---|
0 | frankenstein.txt | The Project Gutenberg eBook of Frankenstein, b... | The Project Gutenberg eBook of Frankenstein b... |
1 | dracula.txt | The Project Gutenberg eBook of Dracula, by Bra... | The Project Gutenberg eBook of Dracula by Bra... |
novels['clean_text'] = novels['clean_text'].str.replace(r'\d', ' ', regex = True)
novels
file_name | document_text | clean_text | |
---|---|---|---|
0 | frankenstein.txt | The Project Gutenberg eBook of Frankenstein, b... | The Project Gutenberg eBook of Frankenstein b... |
1 | dracula.txt | The Project Gutenberg eBook of Dracula, by Bra... | The Project Gutenberg eBook of Dracula by Bra... |
novels['clean_text'] = novels['clean_text'].str.encode('ascii', 'ignore').str.decode('ascii')
novels
file_name | document_text | clean_text | |
---|---|---|---|
0 | frankenstein.txt | The Project Gutenberg eBook of Frankenstein, b... | The Project Gutenberg eBook of Frankenstein b... |
1 | dracula.txt | The Project Gutenberg eBook of Dracula, by Bra... | The Project Gutenberg eBook of Dracula by Bra... |
novels['clean_text'] = novels['clean_text'].str.replace(r'\s+', ' ', regex = True)
novels
file_name | document_text | clean_text | |
---|---|---|---|
0 | frankenstein.txt | The Project Gutenberg eBook of Frankenstein, b... | The Project Gutenberg eBook of Frankenstein by... |
1 | dracula.txt | The Project Gutenberg eBook of Dracula, by Bra... | The Project Gutenberg eBook of Dracula by Bram... |
novels['clean_text'] = novels['clean_text'].str.lower()
novels
file_name | document_text | clean_text | |
---|---|---|---|
0 | frankenstein.txt | The Project Gutenberg eBook of Frankenstein, b... | the project gutenberg ebook of frankenstein by... |
1 | dracula.txt | The Project Gutenberg eBook of Dracula, by Bra... | the project gutenberg ebook of dracula by bram... |
# !python -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')
novels['clean_text'] = novels['clean_text'].apply(lambda row: ' '.join([w.lemma_ for w in nlp(row)]))
novels
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[83], line 2
1 nlp = spacy.load('en_core_web_sm')
----> 2 novels['clean_text'] = novels['clean_text'].apply(lambda row: ' '.join([w.lemma_ for w in nlp(row)]))
3 novels
File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py:4433, in Series.apply(self, func, convert_dtype, args, **kwargs)
4323 def apply(
4324 self,
4325 func: AggFuncType,
(...)
4328 **kwargs,
4329 ) -> DataFrame | Series:
4330 """
4331 Invoke function on values of Series.
4332
(...)
4431 dtype: float64
4432 """
-> 4433 return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py:1088, in SeriesApply.apply(self)
1084 if isinstance(self.f, str):
1085 # if we are a string, try to dispatch
1086 return self.apply_str()
-> 1088 return self.apply_standard()
File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py:1143, in SeriesApply.apply_standard(self)
1137 values = obj.astype(object)._values
1138 # error: Argument 2 to "map_infer" has incompatible type
1139 # "Union[Callable[..., Any], str, List[Union[Callable[..., Any], str]],
1140 # Dict[Hashable, Union[Union[Callable[..., Any], str],
1141 # List[Union[Callable[..., Any], str]]]]]"; expected
1142 # "Callable[[Any], Any]"
-> 1143 mapped = lib.map_infer(
1144 values,
1145 f, # type: ignore[arg-type]
1146 convert=self.convert_dtype,
1147 )
1149 if len(mapped) and isinstance(mapped[0], ABCSeries):
1150 # GH#43986 Need to do list(mapped) in order to get treated as nested
1151 # See also GH#25959 regarding EA support
1152 return obj._constructor_expanddim(list(mapped), index=obj.index)
File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/_libs/lib.pyx:2870, in pandas._libs.lib.map_infer()
Cell In[83], line 2, in <lambda>(row)
1 nlp = spacy.load('en_core_web_sm')
----> 2 novels['clean_text'] = novels['clean_text'].apply(lambda row: ' '.join([w.lemma_ for w in nlp(row)]))
3 novels
File ~/opt/anaconda3/lib/python3.8/site-packages/spacy/language.py:1020, in Language.__call__(self, text, disable, component_cfg)
1018 error_handler = proc.get_error_handler()
1019 try:
-> 1020 doc = proc(doc, **component_cfg.get(name, {})) # type: ignore[call-arg]
1021 except KeyError as e:
1022 # This typically happens if a component is not initialized
1023 raise ValueError(Errors.E109.format(name=name)) from e
File ~/opt/anaconda3/lib/python3.8/site-packages/spacy/pipeline/trainable_pipe.pyx:52, in spacy.pipeline.trainable_pipe.TrainablePipe.__call__()
File ~/opt/anaconda3/lib/python3.8/site-packages/spacy/pipeline/transition_parser.pyx:250, in spacy.pipeline.transition_parser.Parser.predict()
File ~/opt/anaconda3/lib/python3.8/site-packages/spacy/pipeline/transition_parser.pyx:265, in spacy.pipeline.transition_parser.Parser.greedy_parse()
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:315, in Model.predict(self, X)
311 def predict(self, X: InT) -> OutT:
312 """Call the model's `forward` function with `is_train=False`, and return
313 only the output, instead of the `(output, callback)` tuple.
314 """
--> 315 return self._func(self, X, is_train=False)[0]
File ~/opt/anaconda3/lib/python3.8/site-packages/spacy/ml/tb_framework.py:33, in forward(model, X, is_train)
32 def forward(model, X, is_train):
---> 33 step_model = ParserStepModel(
34 X,
35 model.layers,
36 unseen_classes=model.attrs["unseen_classes"],
37 train=is_train,
38 has_upper=model.attrs["has_upper"],
39 )
41 return step_model, step_model.finish_steps
File ~/opt/anaconda3/lib/python3.8/site-packages/spacy/ml/parser_model.pyx:217, in spacy.ml.parser_model.ParserStepModel.__init__()
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
289 """Call the model's `forward` function, returning the output and a
290 callback to compute the gradients via backpropagation."""
--> 291 return self._func(self, X, is_train=is_train)
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/chain.py:54, in forward(model, X, is_train)
52 callbacks = []
53 for layer in model.layers:
---> 54 Y, inc_layer_grad = layer(X, is_train=is_train)
55 callbacks.append(inc_layer_grad)
56 X = Y
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
289 """Call the model's `forward` function, returning the output and a
290 callback to compute the gradients via backpropagation."""
--> 291 return self._func(self, X, is_train=is_train)
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/chain.py:54, in forward(model, X, is_train)
52 callbacks = []
53 for layer in model.layers:
---> 54 Y, inc_layer_grad = layer(X, is_train=is_train)
55 callbacks.append(inc_layer_grad)
56 X = Y
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
289 """Call the model's `forward` function, returning the output and a
290 callback to compute the gradients via backpropagation."""
--> 291 return self._func(self, X, is_train=is_train)
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/with_array.py:40, in forward(model, Xseq, is_train)
38 return model.layers[0](Xseq, is_train)
39 else:
---> 40 return _list_forward(cast(Model[List2d, List2d], model), Xseq, is_train)
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/with_array.py:76, in _list_forward(model, Xs, is_train)
74 lengths = layer.ops.asarray1i([len(seq) for seq in Xs])
75 Xf = layer.ops.flatten(Xs, pad=pad) # type: ignore
---> 76 Yf, get_dXf = layer(Xf, is_train)
78 def backprop(dYs: List2d) -> List2d:
79 dYf = layer.ops.flatten(dYs, pad=pad) # type: ignore
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
289 """Call the model's `forward` function, returning the output and a
290 callback to compute the gradients via backpropagation."""
--> 291 return self._func(self, X, is_train=is_train)
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/chain.py:54, in forward(model, X, is_train)
52 callbacks = []
53 for layer in model.layers:
---> 54 Y, inc_layer_grad = layer(X, is_train=is_train)
55 callbacks.append(inc_layer_grad)
56 X = Y
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
289 """Call the model's `forward` function, returning the output and a
290 callback to compute the gradients via backpropagation."""
--> 291 return self._func(self, X, is_train=is_train)
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/residual.py:40, in forward(model, X, is_train)
37 else:
38 return d_output + dX
---> 40 Y, backprop_layer = model.layers[0](X, is_train)
41 if isinstance(X, list):
42 return [X[i] + Y[i] for i in range(len(X))], backprop
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
289 """Call the model's `forward` function, returning the output and a
290 callback to compute the gradients via backpropagation."""
--> 291 return self._func(self, X, is_train=is_train)
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/chain.py:54, in forward(model, X, is_train)
52 callbacks = []
53 for layer in model.layers:
---> 54 Y, inc_layer_grad = layer(X, is_train=is_train)
55 callbacks.append(inc_layer_grad)
56 X = Y
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
289 """Call the model's `forward` function, returning the output and a
290 callback to compute the gradients via backpropagation."""
--> 291 return self._func(self, X, is_train=is_train)
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/chain.py:54, in forward(model, X, is_train)
52 callbacks = []
53 for layer in model.layers:
---> 54 Y, inc_layer_grad = layer(X, is_train=is_train)
55 callbacks.append(inc_layer_grad)
56 X = Y
[... skipping similar frames: Model.__call__ at line 291 (1 times)]
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/chain.py:54, in forward(model, X, is_train)
52 callbacks = []
53 for layer in model.layers:
---> 54 Y, inc_layer_grad = layer(X, is_train=is_train)
55 callbacks.append(inc_layer_grad)
56 X = Y
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
289 """Call the model's `forward` function, returning the output and a
290 callback to compute the gradients via backpropagation."""
--> 291 return self._func(self, X, is_train=is_train)
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/layernorm.py:26, in forward(model, X, is_train)
24 def forward(model: Model[InT, InT], X: InT, is_train: bool) -> Tuple[InT, Callable]:
25 N, mu, var = _get_moments(model.ops, X)
---> 26 Xhat = (X - mu) * var ** (-1.0 / 2.0)
27 Y, backprop_rescale = _begin_update_scale_shift(model, Xhat)
29 def backprop(dY: InT) -> InT:
KeyboardInterrupt:
tf_vectorizer = TfidfVectorizer(ngram_range = (1, 3),
stop_words = 'english',
max_df = 0.50
)
tf_sparse = tf_vectorizer.fit_transform(novels['clean_text'])
tf_sparse.shape
(2, 168048)
tfidf_df = pd.DataFrame(tf_sparse.todense(), columns = tf_vectorizer.get_feature_names())
tfidf_df
/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
warnings.warn(msg, category=FutureWarning)
aback | aback moment | aback moment know | abaft | abaft bi | abaft bi bank | abaft krok | abaft krok hooal | abandon abortion | abandon abortion spurn | ... | zophagous life eat | zophagous patient | zophagous patient effect | zophagous patient outburst | zophagous patient report | zophagous wild | zophagous wild raving | zophagy | zophagy puzzle | zophagy puzzle little | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.002998 | 0.002998 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
1 | 0.001011 | 0.001011 | 0.001011 | 0.002021 | 0.001011 | 0.001011 | 0.001011 | 0.001011 | 0.000000 | 0.000000 | ... | 0.001011 | 0.003032 | 0.001011 | 0.001011 | 0.001011 | 0.001011 | 0.001011 | 0.001011 | 0.001011 | 0.001011 |
2 rows × 168048 columns
tfidf_df.max().sort_values(ascending = False).head(n = 20)
van 0.326435
helsing 0.310265
van helsing 0.310265
lucy 0.304201
elizabeth 0.275854
mina 0.246595
jonathan 0.210212
count 0.202127
dr 0.191010
harker 0.178882
clerval 0.176906
justine 0.164913
felix 0.149920
seward 0.140478
diary 0.120265
dr seward 0.118244
perceive 0.116938
box 0.116223
geneva 0.107943
misfortune 0.098948
dtype: float64
titles = novels['file_name'].str.slice(stop = -4)
titles = list(titles)
titles
['frankenstein', 'dracula']
tfidf_df['TITLE'] = titles
tfidf_df
aback | aback moment | aback moment know | abaft | abaft bi | abaft bi bank | abaft krok | abaft krok hooal | abandon abortion | abandon abortion spurn | ... | zophagous patient | zophagous patient effect | zophagous patient outburst | zophagous patient report | zophagous wild | zophagous wild raving | zophagy | zophagy puzzle | zophagy puzzle little | TITLE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.002998 | 0.002998 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | frankenstein |
1 | 0.001011 | 0.001011 | 0.001011 | 0.002021 | 0.001011 | 0.001011 | 0.001011 | 0.001011 | 0.000000 | 0.000000 | ... | 0.003032 | 0.001011 | 0.001011 | 0.001011 | 0.001011 | 0.001011 | 0.001011 | 0.001011 | 0.001011 | dracula |
2 rows × 168049 columns
# dracula top 20 words
title = tfidf_df[tfidf_df['TITLE'] == 'frankenstein']
title.max(numeric_only = True).sort_values(ascending = False).head(20)
elizabeth 0.275854
clerval 0.176906
justine 0.164913
felix 0.149920
perceive 0.116938
geneva 0.107943
misfortune 0.098948
frankenstein 0.092951
beheld 0.083955
victor 0.083955
murderer 0.080957
henry 0.077959
cousin 0.077959
safie 0.074960
william 0.074960
cottager 0.071962
chapter chapter 0.068963
chapter chapter chapter 0.065965
creator 0.062967
exclaim 0.059968
dtype: float64
# dracula top 20 words
title = tfidf_df[tfidf_df['TITLE'] == 'dracula']
title.max(numeric_only = True).sort_values(ascending = False).head(20)
van 0.326435
helsing 0.310265
van helsing 0.310265
lucy 0.304201
mina 0.246595
jonathan 0.210212
count 0.202127
dr 0.191010
harker 0.178882
seward 0.140478
diary 0.120265
dr seward 0.118244
box 0.116223
sort 0.096010
madam 0.095000
madam mina 0.087925
don 0.086915
quincey 0.085904
godalming 0.079840
morris 0.078829
dtype: float64
Chapter 8 - Exercises#
Filter the tokens from the HG Well’s
text
variable to 1) lowercase all text, 2) remove punctuation, 3) remove spaces and line breaks, 4) remove numbers, and 5) remove stopwords - all in one line!
# From H.G. Wells's A Short History of the World, Project Gutenberg
text = """Even under the Assyrian monarchs and especially under
Sardanapalus, Babylon had been a scene of great intellectual
activity. {111} Sardanapalus, though an Assyrian, had been quite
Babylon-ized. He made a library, a library not of paper but of
the clay tablets that were used for writing in Mesopotamia since
early Sumerian days. His collection has been unearthed and is
perhaps the most precious store of historical material in the
world. The last of the Chaldean line of Babylonian monarchs,
Nabonidus, had even keener literary tastes. He patronized
antiquarian researches, and when a date was worked out by his
investigators for the accession of Sargon I he commemorated the
fact by inscriptions. But there were many signs of disunion in
his empire, and he sought to centralize it by bringing a number of
the various local gods to Babylon and setting up temples to them
there. This device was to be practised quite successfully by the
Romans in later times, but in Babylon it roused the jealousy of
the powerful priesthood of Bel Marduk, the dominant god of the
Babylonians. They cast about for a possible alternative to
Nabonidus and found it in Cyrus the Persian, the ruler of the
adjacent Median Empire. Cyrus had already distinguished himself
by conquering Croesus, the rich king of Lydia in Eastern Asia
Minor. {112} He came up against Babylon, there was a battle
outside the walls, and the gates of the city were opened to him
(538 B.C.). His soldiers entered the city without fighting. The
crown prince Belshazzar, the son of Nabonidus, was feasting, the
Bible relates, when a hand appeared and wrote in letters of fire
upon the wall these mystical words: _"Mene, Mene, Tekel,
Upharsin,"_ which was interpreted by the prophet Daniel, whom he
summoned to read the riddle, as "God has numbered thy kingdom and
finished it; thou art weighed in the balance and found wanting and
thy kingdom is given to the Medes and Persians." Possibly the
priests of Bel Marduk knew something about that writing on the
wall. Belshazzar was killed that night, says the Bible.
Nabonidus was taken prisoner, and the occupation of the city was
so peaceful that the services of Bel Marduk continued without
intermission."""
# Once we've installed the model, we can import it like any other Python library
import en_core_web_md
# This instantiates a spaCy text processor based on the installed model
nlp = en_core_web_md.load()
2023-04-07 07:25:16.875006: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
# Apply the pipeline
doc = nlp(text)
# lowercase all text
clean = [token.lower_ for token in doc if \
# remove punctuation
token.is_punct == False and \
# remove spaces and line breaks
token.is_space == False and \
# remove numbers
token.is_alpha == True and \
# remove (english) stopwords
token.is_stop == False]
print(clean)
['assyrian', 'monarchs', 'especially', 'sardanapalus', 'babylon', 'scene', 'great', 'intellectual', 'activity', 'sardanapalus', 'assyrian', 'babylon', 'ized', 'library', 'library', 'paper', 'clay', 'tablets', 'writing', 'mesopotamia', 'early', 'sumerian', 'days', 'collection', 'unearthed', 'precious', 'store', 'historical', 'material', 'world', 'chaldean', 'line', 'babylonian', 'monarchs', 'nabonidus', 'keener', 'literary', 'tastes', 'patronized', 'antiquarian', 'researches', 'date', 'worked', 'investigators', 'accession', 'sargon', 'commemorated', 'fact', 'inscriptions', 'signs', 'disunion', 'empire', 'sought', 'centralize', 'bringing', 'number', 'local', 'gods', 'babylon', 'setting', 'temples', 'device', 'practised', 'successfully', 'romans', 'later', 'times', 'babylon', 'roused', 'jealousy', 'powerful', 'priesthood', 'bel', 'marduk', 'dominant', 'god', 'babylonians', 'cast', 'possible', 'alternative', 'nabonidus', 'found', 'cyrus', 'persian', 'ruler', 'adjacent', 'median', 'empire', 'cyrus', 'distinguished', 'conquering', 'croesus', 'rich', 'king', 'lydia', 'eastern', 'asia', 'minor', 'came', 'babylon', 'battle', 'outside', 'walls', 'gates', 'city', 'opened', 'soldiers', 'entered', 'city', 'fighting', 'crown', 'prince', 'belshazzar', 'son', 'nabonidus', 'feasting', 'bible', 'relates', 'hand', 'appeared', 'wrote', 'letters', 'fire', 'wall', 'mystical', 'words', 'mene', 'mene', 'tekel', 'upharsin', 'interpreted', 'prophet', 'daniel', 'summoned', 'read', 'riddle', 'god', 'numbered', 'thy', 'kingdom', 'finished', 'thou', 'art', 'weighed', 'balance', 'found', 'wanting', 'thy', 'kingdom', 'given', 'medes', 'persians', 'possibly', 'priests', 'bel', 'marduk', 'knew', 'writing', 'wall', 'belshazzar', 'killed', 'night', 'says', 'bible', 'nabonidus', 'taken', 'prisoner', 'occupation', 'city', 'peaceful', 'services', 'bel', 'marduk', 'continued', 'intermission']
Read through the spacy101 guide and begin to apply its principles to your own corpus: https://spacy.io/usage/spacy-101
Chapter 9 - Exercise#
Repeat the steps in this notebook with your own data. However, real data does not come with a
fetch
function. What importation steps do you need to consider so your own corpus works?