Chapter 2 - Core machine learning concepts; building text vocabularies¶

2022 January 26

towardsdatascience.com - figure above and quotes below

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import math
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import DictVectorizer
import scipy.sparse

What is machine learning?¶

Machine learning is a toolbox for modeling, investigating, and understanding data. To understand machine learning, a handful of key terms need to be defined.

scikit-learn provides the foundation for this introductory machine learning session.

Be sure to bookmark the below resources. Try them out with data from the data/ folder as well as your own data!

scikit-learn tutorials: https://scikit-learn.org/stable/tutorial/index.html
scikit-learn user guide: https://scikit-learn.org/stable/user_guide.html

Key terms¶

Artificial intelligence: “A program that can sense, reason, act, and adapt.”
Machine learning: “Algorithims whose performance improve as they are exposed to more data over time.”
Deep learning: “Subset of machine learning in which multilayered neural networks learn from vast amounts of data.”

x and y variables¶

x variables are the independent variables. You will also see these referred to in a variety of contexts as features, covariates, predictors, input, or explanatory variables.
The y variable is the dependent variable. This is also referred to as the response, outcome, and target variable.

Types of machine learning:¶

Supervised machine learning: algorithms train models that (best) learn the relationships between the x and y variables on a labeled dataset (called the training set, where the ground truth y variable is known). The performance of the trained model is then evaluated after it predicts the outcome on new data that the model has never seen before (called the test set) where where ground truth y label is unknown.

The equation looks like: \(y=f(x)+ϵ\)

\(f\) is the unknown function that relates x to y and is estimated
\(ϵ\) is random error independent of x and averages to zero

Data splitting: Training and test performance¶

Preparation for supervised methods requires a split of the dataset into training and test sets.

Training set: is comprised of the data that will be used by the model to learn the relationships between the x and y variables. This “usually” consists of a majority portion of the rows in the dataset, such as 70%.
Test set: consists of the holdout data that the trained model has not yet seen and where the trained model will try predict the unknown labels. This “usually” consists of the remaining 30% of rows.

NOTE: These values are somewhat arbitrary. It could be argued that the more training data that are available, the “better” your model might be. Conversely, more test data could lead to higher performance on the test data.

Furthermore, if you have millions of rows of data, a splits of 50:50 or 99:1 might also work fine depending on the context.

Performance is generally assessed through cost or loss functions such as misclassification error, sensitivity/specificity, precision/recall, (root) mean squared error, R-squared, AIC, BIC, and other confusion matrix derivations.
Cross-validation is the best way to split your data, because it repeats the train/test data splitting process \(k\) number of times and rotates the test set each time so that every observation is included in a test set at least once. Performance results are then averaged to compute predictive performance.

cv_wiki

Model evaluation¶

Underfit: A model is underfit if it performs poorly on the training data because it could not discern the relationship(s) between the x and **_y_variables.
Overfit: A model is overfit if it performs well on the training dataset but poorly on the test set.
Bias/variance tradeoff:
Bias refers to the systematic error due to incorrect assumptions about the machine learning process or the framing of the problem itself. High bias models generally underfit and/or contain large amounts of error.
Variance refers to changes in the model predictive accuracy due to fluctuations in the training set or when using different training datasets, and how much the model can adjust to each. High variance could be due to model complexity, noise in the dataset, and overfitting.
The tradeoff means that models with high bias will have low variance, and vice-versa. The goal is to find the optimal combination of these two factors for your problem.

Classification versus regression¶

Supervised classification: models probability that y is 1 given the covariates x and is performed when y is categorical (either binary yes/no or multinomial): \(Pr(y=1|x)\)
Medical diagnosis
Image classification
Fraud detection
Sentiment analysis

Supervised regression: models the conditional mean probability \(E(y|x)\) and is performed when y is continuous (i.e., integer or float).
Life expectancy
Population growth
Weather forecasting
Stock market prediction

Supervised algorithms include:
Regressions: ordinary least squares, generalized linear model, lasso
Tree-based methods: decision tree, random forest, boosted trees
Kernel-based methods: various; transform non-linear to linear space

View scikit-learn’s interactive documentation for estimator (algorithm) selection: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

skest

Unsupervised machine learning: algorithms are used to sort unlabeled datasets, without attempting to predict anything.

The equation could simply look like:

\(\sim\ x\)

Algorithms include:

Clustering: hierarchical agglomerative, k-means, centroid, density-based, DBSCAN, GLRM
Dimension reduction: PCA, t-SNE, UMAP

Reinforcement: algorithims are used to make real-time decisions and skill acquisition in fields such as game-theory and robotics.

Other: semi-supervised, etc.

Bias in research

bias_res

Ethical considerations:

Begin to familiarize yourself with the many ethnical points to consider along with their real-world implications.

OLS example - by “hand”¶

Ordinary least squares regression provides a great example of the machine learning process and some of the key terms defined above.

Generate toy data¶

import pandas as pd
data = pd.DataFrame({"x": [1,2,3,4,5],
                     "y": [2,4,6,8,20]})
data

	x	y
0	1	2
1	2	4
2	3	6
3	4	8
4	5	20

Calculate means of x and y¶

x_mean = round(sum(data.x) / len(data.x), 2)
y_mean = round(sum(data.y) / len(data.y), 2)
print("mean of x:", x_mean)
print("mean of y:", y_mean)

mean of x: 3.0
mean of y: 8.0

Visualize the data (training data)¶

Plot the known relationships between x and y.

plt.scatter(data.x, data.y);

Calculate error for each observation¶

Error (least squares method) is the difference between each observation and the mean. Below this will be represented visually by the distance between each point and the line of best fit.

x_error = data.x - x_mean
y_error = data.y - y_mean

# x
print(data.x)
print(x_mean)
print(x_error)

  1
  2
  3
  4
  5
Name: x, dtype: int64
0
 -2.0
 -1.0
  0.0
  1.0
  2.0
Name: x, dtype: float64

# y
print(data.y)
print(y_mean)
print(y_error)

   2
   4
   6
   8
  20
Name: y, dtype: int64
0
  -6.0
  -4.0
  -2.0
   0.0
  12.0
Name: y, dtype: float64

Compute beta coefficients¶

## Estimate the B1 coefficient (slope)
B1 = sum((data.x - x_mean) * (data.y - y_mean)) / sum((data.x - x_mean) **2)
print("slope (B1) is: ", B1)

## Estimate B0 coefficient (intercept)
B0 = y_mean - (B1 * x_mean)
print("intercept (B0) is: ", B0)

slope (B1) is:  4.0
intercept (B0) is:  -4.0

Plot best fit line¶

import seaborn as sns
import numpy as np
sns.regplot(data = data, 
            x = "x", y = "y",
           ci = None)
plt.xticks(np.arange(min(data.x), max(data.x)+1, 1.0))
plt.yticks(np.arange(min(data.y), max(data.y)+1, 2.0));

Generate the predictions (test set)¶

y_hat = B0 + B1 * data.x
print("Ground truth y values:")
print(data.y)
print()
print("Predicted y values:")
print(y_hat)

Ground truth y values:
   2
   4
   6
   8
  20
Name: y, dtype: int64

Predicted y values:
   0.0
   4.0
   8.0
  12.0
  16.0
Name: x, dtype: float64

Visualize the error for each observation¶

ls_err

Evaluate the performance using RMSE¶

Root mean squared error is a handy performance metric because it has the same units of whatever is plotted on the y-axis.

In our toy example here, we are off an average of 2.82843 units on the y-axis of a given data point from the line of best fit.

# 1. Square each error
y_error_sq = y_error ** 2
print(y_error_sq)

   36.0
   16.0
    4.0
    0.0
  144.0
Name: y, dtype: float64

# 2. Sum these values
sum_error_sq = sum(y_error_sq)
print(sum_error_sq)

200.0

# 3. Calculate RMSE - the square root of summed squared error divided by the length of y
import math
RMSE = math.sqrt(sum_error_sq) / len(data.y)
print(round(RMSE, 5))

2.82843

Compare our by “hand” results to the scikit-learn version!¶

# use scikit-learn to compute R-squared value
from sklearn.linear_model import LinearRegression

lin_mod = LinearRegression().fit(data[['x']], data[['y']])
print("R-squared: " + str(lin_mod.score(data[['x']], data[['y']])))

R-squared: 0.7999999999999999

# use scikit-learn to compute slope and intercept
print("scikit-learn slope: " + str(lin_mod.coef_))
print("scikit-learn intercept: " + str(lin_mod.intercept_))

scikit-learn slope: [[4.]]
scikit-learn intercept: [-4.]

# compare to our by "hand" versions. Both are the same!
print(int(lin_mod.coef_) == B1)
print(int(lin_mod.intercept_) == B0)

True
True

# use scikit-learn to compute RMSE
from sklearn.metrics import mean_squared_error

RMSE_scikit = round(mean_squared_error(data.y, y_hat, squared = False), 5)
print(RMSE_scikit)

2.82843

# Does our hand-computed RMSE equal that of scikit-learn at 5 digits?? Yes!
print(round(RMSE, 5) == round(RMSE_scikit, 5))

True

Supervised machine learning - logistic regression¶

Fortunately, you do not have to do any of this by hand thanks to scikit-learn!

Here is a short workflow to fit a logistic regression model. Logistic regression is similar to linear regression \(y \sim\ x\) but regresses for the probability of a categorical outcome.

Let’s calculate training and test set accuracy to predict whether a penguin is MALE or FEMALE based on thier biological and spatial characteristics.

# load data
penguins = pd.read_csv("data/penguins.csv")

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [19], in <module>
      1 # load data
----> 2 penguins = pd.read_csv("data/penguins.csv")

File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py:610, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    605 kwds_defaults = _refine_defaults_read(
    606     dialect, delimiter, delim_whitespace, engine, sep, defaults={"delimiter": ","}
    607 )
    608 kwds.update(kwds_defaults)
--> 610 return _read(filepath_or_buffer, kwds)

File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py:462, in _read(filepath_or_buffer, kwds)
    459 _validate_names(kwds.get("names", None))
    461 # Create the parser.
--> 462 parser = TextFileReader(filepath_or_buffer, **kwds)
    464 if chunksize or iterator:
    465     return parser

File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py:819, in TextFileReader.__init__(self, f, engine, **kwds)
    816 if "has_index_names" in kwds:
    817     self.options["has_index_names"] = kwds["has_index_names"]
--> 819 self._engine = self._make_engine(self.engine)

File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py:1050, in TextFileReader._make_engine(self, engine)
   1046     raise ValueError(
   1047         f"Unknown engine: {engine} (valid options are {mapping.keys()})"
   1048     )
   1049 # error: Too many arguments for "ParserBase"
-> 1050 return mapping[engine](self.f, **self.options)

File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py:1867, in CParserWrapper.__init__(self, src, **kwds)
   1864 kwds["usecols"] = self.usecols
   1866 # open handles
-> 1867 self._open_handles(src, kwds)
   1868 assert self.handles is not None
   1869 for key in ("storage_options", "encoding", "memory_map", "compression"):

File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py:1362, in ParserBase._open_handles(self, src, kwds)
   1358 def _open_handles(self, src: FilePathOrBuffer, kwds: Dict[str, Any]) -> None:
   1359     """
   1360     Let the readers open IOHanldes after they are done with their potential raises.
   1361     """
-> 1362     self.handles = get_handle(
   1363         src,
   1364         "r",
   1365         encoding=kwds.get("encoding", None),
   1366         compression=kwds.get("compression", None),
   1367         memory_map=kwds.get("memory_map", False),
   1368         storage_options=kwds.get("storage_options", None),
   1369     )

File ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/common.py:642, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    640         errors = "replace"
    641     # Encoding
--> 642     handle = open(
    643         handle,
    644         ioargs.mode,
    645         encoding=ioargs.encoding,
    646         errors=errors,
    647         newline="",
    648     )
    649 else:
    650     # Binary mode
    651     handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: 'data/penguins.csv'

penguins.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	MALE
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	FEMALE
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	FEMALE
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	FEMALE

Preprocess the data - remove rows with NaN (missing) values¶

# count number of rows with missing data in penguins. Eleven...
penguins.isnull().any(axis=1).sum()

# make a copy with listwise deleted rows
p_complete = penguins.dropna()

# count number of rows with missing data in p_complete. Zero! 
p_complete.isnull().any(axis=1).sum()

Preprocess the data - convert categorical variables to indicators¶

Check out this great tutorial on one-hot encoding to learn more.

# convert island categorical variable into numeric indicators
p_dummy_island = pd.get_dummies(p_complete.island, prefix = "island")
p_dummy_island.head()

	island_Biscoe	island_Dream	island_Torgersen
0	0	0	1
1	0	0	1
2	0	0	1
4	0	0	1
5	0	0	1

# convert species categorical variable into numeric indicators
p_dummy_species = pd.get_dummies(p_complete.species, prefix = "species")
p_dummy_species.head()

	species_Adelie	species_Chinstrap	species_Gentoo
0	1	0	0
1	1	0	0
2	1	0	0
4	1	0	0
5	1	0	0

Preprocess the data - remove the island and species variables from p_complete¶

# view column names
print(p_complete.columns)

# drop the first and second columns
p_complete = p_complete.drop(p_complete.columns[[0, 1]], axis=1)
p_complete.head()

Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex'],
      dtype='object')

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	39.1	18.7	181.0	3750.0	MALE
1	39.5	17.4	186.0	3800.0	FEMALE
2	40.3	18.0	195.0	3250.0	FEMALE
4	36.7	19.3	193.0	3450.0	FEMALE
5	39.3	20.6	190.0	3650.0	MALE

Recombine the numeric indicators with the other variables¶

clean_penguins = pd.concat([p_dummy_island, p_dummy_species, p_complete], axis=1) 
clean_penguins.head()

	island_Torgersen	species_Adelie	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	1	1	39.1	18.7	181.0	3750.0	MALE
1	1	1	39.5	17.4	186.0	3800.0	FEMALE
2	1	1	40.3	18.0	195.0	3250.0	FEMALE
4	1	1	36.7	19.3	193.0	3450.0	FEMALE
5	1	1	39.3	20.6	190.0	3650.0	MALE

Recode MALE as 1 and FEMALE as 0¶

clean_penguins['sex'] = clean_penguins['sex'].map({'MALE': 1, 'FEMALE': 0})

clean_penguins.head()

	island_Torgersen	species_Adelie	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	1	1	39.1	18.7	181.0	3750.0	1
1	1	1	39.5	17.4	186.0	3800.0	0
2	1	1	40.3	18.0	195.0	3250.0	0
4	1	1	36.7	19.3	193.0	3450.0	0
5	1	1	39.3	20.6	190.0	3650.0	1

Define your logistic regression object¶

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(solver = 'liblinear')

# Use something like the code below for multi-class classification
# lr = LogisticRegression(multi_class='multinomial', solver='lbfgs')

Split the `clean_penguins` dataset into training and test sets¶

# Define x and y for both training and test sets
from sklearn.model_selection import train_test_split

x = clean_penguins[['island_Biscoe', 'island_Dream', 'island_Torgersen', 'species_Adelie',
       'species_Chinstrap', 'species_Gentoo', 'bill_length_mm',
       'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
y = np.array(clean_penguins['sex']) 

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 123)

x_train.shape

(233, 10)

x_test.shape

(100, 10)

len(y_train)

len(y_test)

Fit the model to the training data (0.84 accuracy)¶

lr.fit(x_train, y_train)

LogisticRegression(solver='liblinear')

# View the accuracy
lr.score(x_train, y_train)

0.8412017167381974

Generate predicted values on the test data¶

# Generate predicted y values based on the x test set data
predictions = lr.predict(x_test)
predictions

array([1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1])

Calculate test set accuracy (0.81 accuracy)¶

test_score = lr.score(x_test, y_test)
print(test_score)

0.81

from sklearn import metrics

cm = metrics.confusion_matrix(y_test, predictions)
print(cm)

[[34 10]
 [ 9 47]]

# Fancy it up! Use plt.savefig() to export
plt.figure(figsize=(8,8))
sns.heatmap(cm, annot = True, fmt = ".0f", 
            cbar = False, linewidths = 2, 
            square = True, cmap = 'YlGnBu', annot_kws={"size": 20})
plt.ylabel('Ground truth y label')
plt.xlabel('Predicted y label')
all_sample_title = 'Accuracy Score: {0}'.format(test_score)
plt.title(all_sample_title, size = 15);

Building text vocabularies¶

What about how text is predicted? If we have text as data, we also need to convert the representation of text into a format that machine learning algorithms can understand.

Define a corpus¶

corpus = [
    "First document text goes in this list of strings.",
    "Second document text goes here as well.",
    "Third document text text.",
    "Document four!"
    ]
corpus

['First document text goes in this list of strings.',
 'Second document text goes here as well.',
 'Third document text text.',
 'Document four!']

Bag of words model with `CountVectorizer`¶

A bag of words model removes punctuation and casefolds the text to lowercase before counting the words.

from sklearn.feature_extraction.text import CountVectorizer

# create the bag of words
vectorizer = CountVectorizer()

# the .fit method will learn the vocabulary 
vectorizer.fit(corpus)

# Print the vocabulary
vectorizer.vocabulary_

{'first': 2,
 'document': 1,
 'text': 11,
 'goes': 4,
 'in': 6,
 'this': 13,
 'list': 7,
 'of': 8,
 'strings': 10,
 'second': 9,
 'here': 5,
 'as': 0,
 'well': 14,
 'third': 12,
 'four': 3}

Document term matrix¶

A document term matrix displays term frequencies in sparse format across a collection of docuemnts. The .transform method will help us here!

The first number is the document number
The second number is the word from the vocabulary
The third number is the number of times that word occurs

vector = vectorizer.transform(corpus)
print(vector) 

  (0, 1)	1
  (0, 2)	1
  (0, 4)	1
  (0, 6)	1
  (0, 7)	1
  (0, 8)	1
  (0, 10)	1
  (0, 11)	1
  (0, 13)	1
  (1, 0)	1
  (1, 1)	1
  (1, 4)	1
  (1, 5)	1
  (1, 9)	1
  (1, 11)	1
  (1, 14)	1
  (2, 1)	1
  (2, 11)	2
  (2, 12)	1
  (3, 1)	1
  (3, 3)	1

Present the sparse matrix¶

# each row is a document, each column is a word from the vocabulary! 
# 1 means that word is present in that particular document
print(vector.toarray())

[[0 1 1 0 1 0 1 1 1 0 1 1 0 1 0]
 [1 1 0 0 1 1 0 0 0 1 0 1 0 0 1]
 [0 1 0 0 0 0 0 0 0 0 0 2 1 0 0]
 [0 1 0 1 0 0 0 0 0 0 0 0 0 0 0]]

# get the column names (alphabetical sort)
vectorizer.get_feature_names()

['as',
 'document',
 'first',
 'four',
 'goes',
 'here',
 'in',
 'list',
 'of',
 'second',
 'strings',
 'text',
 'third',
 'this',
 'well']

# What does this tell us? 
vectorizer.transform(['document']).toarray()

array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

N-grams¶

While we have focused on single words above, we can apply this approach to multi-word phrases in order to try to predict linguistic instances and potentially better understand context of a corpus of documents.

Use the below token patterns:

bigrams ngram_range = (1, 2)
trigrams ngram_range = (1, 3)
etc.

bigram_vectorizer = CountVectorizer(ngram_range = (1,2),
                                    token_pattern = r'\b\w+\b', 
                                    min_df = 1)
bigram_vectorizer

CountVectorizer(ngram_range=(1, 2), token_pattern='\\b\\w+\\b')

Since we specify bigrams, both unigrams and bigrams are returned!

from sklearn.feature_extraction import DictVectorizer
bigram_analyzer = bigram_vectorizer.build_analyzer()
bigram_analyzer('Welcome to Stanford Libraries!')

['welcome',
 'to',
 'stanford',
 'libraries',
 'welcome to',
 'to stanford',
 'stanford libraries']

Apply n-grams to our above corpus¶

corpus

['First document text goes in this list of strings.',
 'Second document text goes here as well.',
 'Third document text text.',
 'Document four!']

# perform the transformation
x = bigram_vectorizer.fit_transform(corpus).toarray()
print(x)

[[0 0 1 0 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 0 0 1 1 1 0 0 0 1 1 0]
 [1 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1]
 [0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 1 1 0 0 0]
 [0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

# get the feature (column) names
# remember the rows are our four documents!
bigram_vectorizer.get_feature_names()

['as',
 'as well',
 'document',
 'document four',
 'document text',
 'first',
 'first document',
 'four',
 'goes',
 'goes here',
 'goes in',
 'here',
 'here as',
 'in',
 'in this',
 'list',
 'list of',
 'of',
 'of strings',
 'second',
 'second document',
 'strings',
 'text',
 'text goes',
 'text text',
 'third',
 'third document',
 'this',
 'this list',
 'well']

# search for vocabulary words across the documents
feature_index = bigram_vectorizer.vocabulary_.get('first document')
x[:, feature_index]

array([1, 0, 0, 0])

Data frame vocabulary across documents¶

Our vocabulary consists of 29 unigrams and bigrams across the four documents.

# redefine x as sparse matrix (not array)
x = bigram_vectorizer.fit_transform(corpus)
print(x)

  (0, 5)	1
  (0, 2)	1
  (0, 22)	1
  (0, 8)	1
  (0, 13)	1
  (0, 27)	1
  (0, 15)	1
  (0, 17)	1
  (0, 21)	1
  (0, 6)	1
  (0, 4)	1
  (0, 23)	1
  (0, 10)	1
  (0, 14)	1
  (0, 28)	1
  (0, 16)	1
  (0, 18)	1
  (1, 2)	1
  (1, 22)	1
  (1, 8)	1
  (1, 4)	1
  (1, 23)	1
  (1, 19)	1
  (1, 11)	1
  (1, 0)	1
  (1, 29)	1
  (1, 20)	1
  (1, 9)	1
  (1, 12)	1
  (1, 1)	1
  (2, 2)	1
  (2, 22)	2
  (2, 4)	1
  (2, 25)	1
  (2, 26)	1
  (2, 24)	1
  (3, 2)	1
  (3, 7)	1
  (3, 3)	1

import scipy.sparse
corpus_df = pd.DataFrame(x.todense(), columns = bigram_vectorizer.get_feature_names())
corpus_df

	as	as well	document	document four	document text	first	first document	four	goes	goes here	...	second document	strings	text	text goes	text text	third	third document	this	this list	well
0	0	0	1	0	1	1	1	0	1	0	...	0	1	1	1	0	0	0	1	1	0
1	1	1	1	0	1	0	0	0	1	1	...	1	0	1	1	0	0	0	0	0	1
2	0	0	1	0	1	0	0	0	0	0	...	0	0	2	0	1	1	1	0	0	0
3	0	0	1	1	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0

4 rows × 30 columns

# The features (columns) are the vocabulary from the bigram version of our corpus variable above! 
corpus_df.columns

Index(['as', 'as well', 'document', 'document four', 'document text', 'first',
       'first document', 'four', 'goes', 'goes here', 'goes in', 'here',
       'here as', 'in', 'in this', 'list', 'list of', 'of', 'of strings',
       'second', 'second document', 'strings', 'text', 'text goes',
       'text text', 'third', 'third document', 'this', 'this list', 'well'],
      dtype='object')

Quiz - Classic horror novel vocabulary¶

Create a single sparse data frame that contains the vocabulary for Frankenstein and Dracula. You should only have two rows (one for each of these novels), but potentially thousands of columns to represent the vocabulary across the two texts.

	as	as well	document	document four	document text	first	first document	four	goes	goes here	...	second document	strings	text	text goes	text text	third	third document	this	this list	well
0	0	0	1	0	1	1	1	0	1	0	...	0	1	1	1	0	0	0	1	1	0
1	1	1	1	0	1	0	0	0	1	1	...	1	0	1	1	0	0	0	0	0	1
2	0	0	1	0	1	0	0	0	0	0	...	0	0	2	0	1	1	1	0	0	0
3	0	0	1	1	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0

	as	as well	document	document four	document text	first	first document	four	goes	goes here	...	second document	strings	text	text goes	text text	third	third document	this	this list	well
0	0	0	1	0	1	1	1	0	1	0	...	0	1	1	1	0	0	0	1	1	0
1	1	1	1	0	1	0	0	0	1	1	...	1	0	1	1	0	0	0	0	0	1
2	0	0	1	0	1	0	0	0	0	0	...	0	0	2	0	1	1	1	0	0	0
3	0	0	1	1	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0

Text Analysis and Machine Learning (TAML) Group