# Chapter 7 - English text preprocessing basics - and applications
2022 August 26

<a target="_blank" href="https://colab.research.google.com/github/EastBayEv/SSDS-TAML/blob/main/fall2022/7_English_text_preprocessing_basics_applications.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

![text](img/text.png)

Unstructured text - text you find in the wild in books and websites - is generally not amenable to analysis. Before it can be analyzed, the text needs to be standardized to a format so that Python can recognize each unit of meaning **(called a "token")** as unique, no matter how many times it occurs and how it is stylized. 

Although not an exhaustive list, some key steps in preprocessing text include:  
* Standardize text casing and spacing 
* Remove punctuation and special characters/symbols
* Remove stop words
* Stem or lemmatize: convert all non-base words to their base form 

Stemming/lemmatization and stop words (and some punctuation) are language-specific. The Natural Language ToolKit (NLTK) works for English out-of-the-box, but you'll need different code to work with other languages. Some languages (e.g. Chinese) also require *segmentation*: artificially inserting spaces between words. If you want to do text pre-processing for other languages, please let us know and we can help!

In [1]:
# Ensure you have the proper nltk modules
import nltk
nltk.download('words')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('omw-1.4')

[nltk_data] Downloading package words to
[nltk_data]     /Users/evanmuzzall/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/evanmuzzall/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/evanmuzzall/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/evanmuzzall/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/evanmuzzall/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/evanmuzzall/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [2]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from string import punctuation
import pandas as pd
import seaborn as sns
from collections import Counter
import regex as re

import os
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
import spacy
import nltk
from nltk.corpus import movie_reviews
import numpy as np
from sklearn.utils import shuffle
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import roc_curve, roc_auc_score, classification_report, accuracy_score, confusion_matrix 

import warnings
warnings.filterwarnings("ignore", category = DeprecationWarning)

## Corpus definition: United Nations Human Rights Council Documentation

![unhrc](img/unhrc.jpg)

We will select eleven .txt files from the UN HRC as our corpus, stored within the subfolder "human_rights" folder inside the main "data" directory. 

These documents contain information about human rights recommendations made by member nations towards countries deemed to be in violation of the HRC. 

[Learn more about the UN HRC by clicking here.](https://www.ohchr.org/en/hrbodies/hrc/pages/home.aspx)

### Define the corpus directory

Set the directory's file path and print the files it contains.

In [23]:
# Make the directory "human_rights" inside of data
!mkdir data
!mkdir data/human_rights

mkdir: data: File exists
mkdir: data/human_rights: File exists


In [24]:
# If your "data" folder already exists in Colab and you want to delete it, type:
# !rm -r data

# If the "human_rights" folder already exists in Colab and you want to delete it, type:
# !rm -r data/human_rights

In [25]:
# Download elevent UN HRC files
# !wget -P data/human_rights/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/human_rights/afghanistan2014.txt
# !wget -P data/human_rights/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/human_rights/bangladesh2013.txt
# !wget -P data/human_rights/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/human_rights/cotedivoire2014.txt
# !wget -P data/human_rights/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/human_rights/djibouti2013.txt
# !wget -P data/human_rights/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/human_rights/fiji2014.txt
# !wget -P data/human_rights/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/human_rights/jordan2013.txt
# !wget -P data/human_rights/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/human_rights/kazakhstan2014.txt
# !wget -P data/human_rights/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/human_rights/monaco2013.txt
# !wget -P data/human_rights/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/human_rights/sanmarino2014.txt
# !wget -P data/human_rights/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/human_rights/turkmenistan2013.txt
# !wget -P data/human_rights/ https://raw.githubusercontent.com/EastBayEv/SSDS-TAML/main/fall2022/data/human_rights/tuvalu2013.txt

In [26]:
# Check that we have eleven files, one for each country
!ls data/human_rights/

afghanistan2014.txt  fiji2014.txt         sanmarino2014.txt
bangladesh2013.txt   jordan2013.txt       turkmenistan2013.txt
cotedivoire2014.txt  kazakhstan2014.txt   tuvalu2013.txt
djibouti2013.txt     monaco2013.txt


In [27]:
import os
corpus = os.listdir('data/human_rights/')

# View the contents of this directory
corpus

['sanmarino2014.txt',
 'tuvalu2013.txt',
 'kazakhstan2014.txt',
 'cotedivoire2014.txt',
 'fiji2014.txt',
 'bangladesh2013.txt',
 'turkmenistan2013.txt',
 'jordan2013.txt',
 'monaco2013.txt',
 'afghanistan2014.txt',
 'djibouti2013.txt']

### Store these documents in a data frame

In [28]:
# Store in an empty dictionary for conversion to data frame
empty_dictionary = {}

# Loop through the folder of documents to open and read each one
for document in corpus:
    with open('data/human_rights/' + document, 'r', encoding = 'utf-8') as to_open:
         empty_dictionary[document] = to_open.read()

# Populate the data frame with two columns: file name and document text
human_rights = (pd.DataFrame.from_dict(empty_dictionary, 
                                       orient = 'index')
                .reset_index().rename(index = str, 
                                      columns = {'index': 'file_name', 0: 'document_text'}))

### View the data frame

In [29]:
human_rights

Unnamed: 0,file_name,document_text
0,sanmarino2014.txt,\n United Nations \n A/HRC/28/9 \n \n \n\n Ge...
1,tuvalu2013.txt,\n United Nations \n A/HRC/24/8 \n \n \n\n G...
2,kazakhstan2014.txt,\n United Nations \n A/HRC/28/10 \n \n \n\n G...
3,cotedivoire2014.txt,\nDistr.: General 7 July 2014 English Original...
4,fiji2014.txt,\n United Nations \n A/HRC/28/8 \n \n \n\n Ge...
5,bangladesh2013.txt,\n United Nations \n A/HRC/24/12 \n \n \n\n ...
6,turkmenistan2013.txt,\n United Nations \n A/HRC/24/3 \n \n \n\n G...
7,jordan2013.txt,\nDistr.: General 6 January 2014 \nOriginal: E...
8,monaco2013.txt,\nDistr.: General 3 January 2014 English Origi...
9,afghanistan2014.txt,\nDistr.: General 4 April 2014 \nOriginal: Eng...


### View the text of the first document

In [30]:
# first thousand characters
print(human_rights['document_text'][0][:1000])

 
 United Nations 
 A/HRC/28/9 
 
 

 General Assembly 
 Distr.: General 
24 December 2014 
 
Original: English 
 

Human Rights Council 

Twenty-eighth session 
Agenda item 6 
Universal Periodic Review 
  Report of the Working Group on the Universal Periodic Review* 
 * The annex to the present report is circulated as received. 
  San Marino 
Contents 
 Paragraphs Page 
  Introduction .............................................................................................................  1Ð4 3 
 I. Summary of the proceedings of the review process ................................................  5Ð77 3 
  A. Presentation by the State under review ...........................................................  5Ð23 3 
  B. Interactive dialogue and responses by the State under review ........................  24Ð77 6 
 II. Conclusions and/or recommendations .....................................................................  78Ð81 13 
 Annex 
  Composition of the delegation .......

## English text preprocessing

Create a new column named "clean_text" to store the text as it is preprocessed. 

### What are some of the things we can do? 

These are just a few examples. How else could you improve this process? 

* Remove non-alphanumeric characters/punctuation
* Remove digits
* Remove [unicode characters](https://en.wikipedia.org/wiki/List_of_Unicode_characters)
* Remove extra spaces
* Convert to lowercase
* Lemmatize (optional for now)

Take a look at the first document after each step to see if you can notice what changed. 

> Remember: the process will likely be different for many other natural languages, which frequently require special considerations. 

### Remove non-alphanumeric characters/punctuation

In [31]:
# Create a new column 'clean_text' to store the text we are standardizing
human_rights['clean_text'] = human_rights['document_text'].str.replace(r'[^\w\s]', ' ', regex = True)

In [32]:
print(human_rights['clean_text'][0][:1000])

 
 United Nations 
 A HRC 28 9 
 
 

 General Assembly 
 Distr   General 
24 December 2014 
 
Original  English 
 

Human Rights Council 

Twenty eighth session 
Agenda item 6 
Universal Periodic Review 
  Report of the Working Group on the Universal Periodic Review  
   The annex to the present report is circulated as received  
  San Marino 
Contents 
 Paragraphs Page 
  Introduction                                                                                                                1Ð4 3 
 I  Summary of the proceedings of the review process                                                   5Ð77 3 
  A  Presentation by the State under review                                                              5Ð23 3 
  B  Interactive dialogue and responses by the State under review                           24Ð77 6 
 II  Conclusions and or recommendations                                                                        78Ð81 13 
 Annex 
  Composition of the delegation        

In [33]:
# view third column
human_rights

Unnamed: 0,file_name,document_text,clean_text
0,sanmarino2014.txt,\n United Nations \n A/HRC/28/9 \n \n \n\n Ge...,\n United Nations \n A HRC 28 9 \n \n \n\n Ge...
1,tuvalu2013.txt,\n United Nations \n A/HRC/24/8 \n \n \n\n G...,\n United Nations \n A HRC 24 8 \n \n \n\n G...
2,kazakhstan2014.txt,\n United Nations \n A/HRC/28/10 \n \n \n\n G...,\n United Nations \n A HRC 28 10 \n \n \n\n G...
3,cotedivoire2014.txt,\nDistr.: General 7 July 2014 English Original...,\nDistr General 7 July 2014 English Original...
4,fiji2014.txt,\n United Nations \n A/HRC/28/8 \n \n \n\n Ge...,\n United Nations \n A HRC 28 8 \n \n \n\n Ge...
5,bangladesh2013.txt,\n United Nations \n A/HRC/24/12 \n \n \n\n ...,\n United Nations \n A HRC 24 12 \n \n \n\n ...
6,turkmenistan2013.txt,\n United Nations \n A/HRC/24/3 \n \n \n\n G...,\n United Nations \n A HRC 24 3 \n \n \n\n G...
7,jordan2013.txt,\nDistr.: General 6 January 2014 \nOriginal: E...,\nDistr General 6 January 2014 \nOriginal E...
8,monaco2013.txt,\nDistr.: General 3 January 2014 English Origi...,\nDistr General 3 January 2014 English Origi...
9,afghanistan2014.txt,\nDistr.: General 4 April 2014 \nOriginal: Eng...,\nDistr General 4 April 2014 \nOriginal Eng...


### Remove digits

In [34]:
human_rights['clean_text'] = human_rights['clean_text'].str.replace(r'\d', ' ', regex = True)

In [35]:
print(human_rights['clean_text'][0][:1000])

 
 United Nations 
 A HRC      
 
 

 General Assembly 
 Distr   General 
   December      
 
Original  English 
 

Human Rights Council 

Twenty eighth session 
Agenda item   
Universal Periodic Review 
  Report of the Working Group on the Universal Periodic Review  
   The annex to the present report is circulated as received  
  San Marino 
Contents 
 Paragraphs Page 
  Introduction                                                                                                                 Ð    
 I  Summary of the proceedings of the review process                                                    Ð     
  A  Presentation by the State under review                                                               Ð     
  B  Interactive dialogue and responses by the State under review                             Ð     
 II  Conclusions and or recommendations                                                                          Ð      
 Annex 
  Composition of the delegation        

### Remove unicode characters such as Ð and ð

In [36]:
# for more on text encodings: https://www.w3.org/International/questions/qa-what-is-encoding
human_rights['clean_text'] = human_rights['clean_text'].str.encode('ascii', 'ignore').str.decode('ascii')

In [37]:
print(human_rights['clean_text'][0][:1000])

 
 United Nations 
 A HRC      
 
 

 General Assembly 
 Distr   General 
   December      
 
Original  English 
 

Human Rights Council 

Twenty eighth session 
Agenda item   
Universal Periodic Review 
  Report of the Working Group on the Universal Periodic Review  
   The annex to the present report is circulated as received  
  San Marino 
Contents 
 Paragraphs Page 
  Introduction                                                                                                                     
 I  Summary of the proceedings of the review process                                                         
  A  Presentation by the State under review                                                                    
  B  Interactive dialogue and responses by the State under review                                  
 II  Conclusions and or recommendations                                                                                
 Annex 
  Composition of the delegation             

### Remove extra spaces

In [38]:
import regex as re
human_rights['clean_text'] = human_rights['clean_text'].str.replace(r'\s+', ' ', regex = True)

In [39]:
print(human_rights['clean_text'][0][:1000])

 United Nations A HRC General Assembly Distr General December Original English Human Rights Council Twenty eighth session Agenda item Universal Periodic Review Report of the Working Group on the Universal Periodic Review The annex to the present report is circulated as received San Marino Contents Paragraphs Page Introduction I Summary of the proceedings of the review process A Presentation by the State under review B Interactive dialogue and responses by the State under review II Conclusions and or recommendations Annex Composition of the delegation Introduction The Working Group on the Universal Periodic Review established in accordance with Human Rights Council resolution of June held its twentieth session from October to November The review of San Marino was held at the th meeting on October The delegation of San Marino was headed by Pasquale Valentini Minister for Foreign Affairs At its th meeting held on October the Working Group adopted the report on San Marino On January the Hu

### Convert to lowercase

In [40]:
human_rights['clean_text'] = human_rights['clean_text'].str.lower()

In [41]:
print(human_rights['clean_text'][0][:1000])

 united nations a hrc general assembly distr general december original english human rights council twenty eighth session agenda item universal periodic review report of the working group on the universal periodic review the annex to the present report is circulated as received san marino contents paragraphs page introduction i summary of the proceedings of the review process a presentation by the state under review b interactive dialogue and responses by the state under review ii conclusions and or recommendations annex composition of the delegation introduction the working group on the universal periodic review established in accordance with human rights council resolution of june held its twentieth session from october to november the review of san marino was held at the th meeting on october the delegation of san marino was headed by pasquale valentini minister for foreign affairs at its th meeting held on october the working group adopted the report on san marino on january the hu

### Lemmatize

In [42]:
# !python -m spacy download en_core_web_sm
# !python -m spacy download en_core_web_lg

In [43]:
nlp = spacy.load('en_core_web_sm')
human_rights['clean_text'] = human_rights['clean_text'].apply(lambda row: ' '.join([w.lemma_ for w in nlp(row)]))

In [44]:
# print(human_rights['clean_text'][0])

  united nations a hrc general assembly distr general december original english human rights council twenty eighth session agenda item universal periodic review report of the work group on the universal periodic review the annex to the present report be circulate as receive san marino content paragraph page introduction I summary of the proceeding of the review process a presentation by the state under review b interactive dialogue and response by the state under review ii conclusion and or recommendation annex composition of the delegation introduction the work group on the universal periodic review establish in accordance with human rights council resolution of june hold its twentieth session from october to november the review of san marino be hold at the th meeting on october the delegation of san marino be head by pasquale valentini minister for foreign affair at its th meeting hold on october the working group adopt the report on san marino on january the human rights council sel

### View the updated data frame

In [45]:
human_rights

Unnamed: 0,file_name,document_text,clean_text
0,sanmarino2014.txt,\n United Nations \n A/HRC/28/9 \n \n \n\n Ge...,united nations a hrc general assembly distr ...
1,tuvalu2013.txt,\n United Nations \n A/HRC/24/8 \n \n \n\n G...,united nations a hrc general assembly distr ...
2,kazakhstan2014.txt,\n United Nations \n A/HRC/28/10 \n \n \n\n G...,united nations a hrc general assembly distr ...
3,cotedivoire2014.txt,\nDistr.: General 7 July 2014 English Original...,distr general july english original english ...
4,fiji2014.txt,\n United Nations \n A/HRC/28/8 \n \n \n\n Ge...,united nations a hrc general assembly distr ...
5,bangladesh2013.txt,\n United Nations \n A/HRC/24/12 \n \n \n\n ...,united nations a hrc general assembly distr ...
6,turkmenistan2013.txt,\n United Nations \n A/HRC/24/3 \n \n \n\n G...,united nations a hrc general assembly distr ...
7,jordan2013.txt,\nDistr.: General 6 January 2014 \nOriginal: E...,distr general january original english gener...
8,monaco2013.txt,\nDistr.: General 3 January 2014 English Origi...,distr general january english original engli...
9,afghanistan2014.txt,\nDistr.: General 4 April 2014 \nOriginal: Eng...,distr general april original english general...


## Exercises - redwoods webscraping

This also works with data scraped from the web. Below is very brief BeautifulSoup example to save the contents of the Sequoioideae (redwood trees) Wikipedia page in a variable named `text`. 

1. Read through the code below
2. Practice by repeating for a webpage of your choice

![redwood](img/redwood.png)

In [68]:
# import necessary libraries
from bs4 import BeautifulSoup
import requests
import regex as re
import nltk

## Three variables will get you started

1. `url` - define the URL to be scraped 
2. `response` - perform the get request on the URL 
3. `soup` - create the soup object so we can parse the html 

In [69]:
url = "https://en.wikipedia.org/wiki/Sequoioideae"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html')

## Get the text

HTML (hypertext markup language) is used to structure a webpage and the content it contains, including text.

Below is a handy for loop that finds all everything within paragraph `<p>`, or paragraph tags. 

In [70]:
# save in an empty string
text = ""

for paragraph in soup.find_all('p'):
    text += paragraph.text

In [71]:
print(text)


Sequoioideae, popularly known as redwoods, is a subfamily of coniferous trees within the family Cupressaceae. It includes the largest and tallest trees in the world.
The three redwood subfamily genera are Sequoia from coastal California and Oregon, Sequoiadendron from California's Sierra Nevada, and Metasequoia in China. The redwood species contains the largest and tallest trees in the world. These trees can live for thousands of years. Threats include logging, fire suppression,[2] climate change, illegal marijuana cultivation, and burl poaching.[3][4][5]
Only two of the genera, Sequoia and Sequoiadendron, are known for massive trees. Trees of Metasequoia, from the single living species Metasequoia glyptostroboides, are much smaller.
Multiple studies of both morphological and molecular characters have strongly supported the assertion that the Sequoioideae are monophyletic.[6][7][8][9]
Most modern phylogenies place Sequoia as sister to Sequoiadendron and Metasequoia as the out-group.[7

## Regular expressions

Regular expressions are sequences of characters and symbols that represent search patterns in text - and are generally quite useful. 

[Check out the tutorial](https://docs.python.org/3/library/re.html) and [cheatsheet](https://www.dataquest.io/blog/regex-cheatsheet/) to find out what the below symbols mean and write your own code. Better yet you could write a pattern to do them simultaneously in one line/less lines of code in some cases!

In [72]:
text = re.sub(r'\[[0-9]*\]',' ',text)
text = re.sub(r'\s+',' ',text)
text = re.sub(r'\d',' ',text)
text = re.sub(r'[^\w\s]','',text)
text = text.lower()
text = re.sub(r'\s+',' ',text)

In [74]:
# print(text)

## Unsupervised learning with `TfidfVectorizer()`

Remember `CountVectorizer()` for creating Bag of Word models? We can extend this idea of counting words, to _counting unique words_ within a document relative to the rest of the corpus with `TfidfVectorizer()`. Each row will still be a document in the document term matrix and each column will still be a linguistic feature, but the cells will now be populated by the word uniqueness weights instead of frequencies. Remember that: 

* For TF-IDF sparse matrices:
    * A value closer to 1 indicate that a feature is more relevant to a particular document.
    * A value closer to 0 indicates that that feature is less/not relevant to that document.

![tf1](img/tf1.png)

[Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

![tf2](img/tf2.png)

[towardsdatascience](https://towardsdatascience.com/tf-term-frequency-idf-inverse-document-frequency-from-scratch-in-python-6c2b61b78558)

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_vectorizer = TfidfVectorizer(ngram_range = (1, 3), 
                                stop_words = 'english', 
                                max_df = 0.50
                                )
tf_sparse = tf_vectorizer.fit_transform(human_rights['clean_text'])

In [47]:
tf_sparse.shape

(11, 84181)

In [48]:
print(tf_sparse)

  (0, 39182)	0.004856846879037771
  (0, 31574)	0.004856846879037771
  (0, 50165)	0.004856846879037771
  (0, 79743)	0.004856846879037771
  (0, 46164)	0.004856846879037771
  (0, 70574)	0.004856846879037771
  (0, 67048)	0.004856846879037771
  (0, 48393)	0.004856846879037771
  (0, 55413)	0.004856846879037771
  (0, 5657)	0.004856846879037771
  (0, 2036)	0.004856846879037771
  (0, 18508)	0.004856846879037771
  (0, 4238)	0.004856846879037771
  (0, 49342)	0.004856846879037771
  (0, 2719)	0.004856846879037771
  (0, 39331)	0.004856846879037771
  (0, 7341)	0.004856846879037771
  (0, 80381)	0.004856846879037771
  (0, 49382)	0.004856846879037771
  (0, 2723)	0.004856846879037771
  (0, 43326)	0.004856846879037771
  (0, 20394)	0.004856846879037771
  (0, 27591)	0.004856846879037771
  (0, 53796)	0.004856846879037771
  (0, 74877)	0.004856846879037771
  :	:
  (10, 23861)	0.004813404453682774
  (10, 53868)	0.005386104036626412
  (10, 61809)	0.005386104036626412
  (10, 11732)	0.006124442213746767
  (10, 522

### Convert the tfidf sparse matrix to data frame

In [49]:
tfidf_df = pd.DataFrame(tf_sparse.todense(), columns = tf_vectorizer.get_feature_names())
tfidf_df



Unnamed: 0,abasi,abasi desk,abasi desk officer,abdi,abdi ismael,abdi ismael hersi,abdou,abdou prsident,abdou prsident la,abduction,...,zone,zone inclusive,zone inclusive education,zone senegal,zone senegal make,zone social,zone social benefit,zouon,zouon bi,zouon bi tidou
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004509,...,0.011473,0.006711,0.006711,0.0,0.0,0.006711,0.006711,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006249,0.006249,0.006249
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011198,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.006296,0.0,0.0,0.007365,0.007365,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004721,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.007608,0.007608,0.007608,0.0,0.0,0.0,0.0,0.0,0.0,0.005111,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### View 20 highest weighted words

In [50]:
tfidf_df.max().sort_values(ascending = False).head(n = 20)

monaco                        0.720348
tuvalu                        0.639251
kazakhstan                    0.615483
fiji                          0.582410
turkmenistan                  0.578904
san                           0.553681
jordan                        0.491254
san marino                    0.456544
marino                        0.456544
divoire                       0.340046
te divoire                    0.340046
te                            0.306989
elimination violence          0.253620
elimination violence woman    0.253620
djiboutis                     0.250777
reconciliation                0.245711
fgm                           0.195982
afghan                        0.190201
bangladeshs                   0.183356
violence woman law            0.182593
dtype: float64

### Add country name to `tfidf_df`

This way, we will know which document is relative to which country.

In [51]:
# wrangle the country names from the human_rights data frame
countries = human_rights['file_name'].str.slice(stop = -8)
countries = list(countries)
countries

['sanmarino',
 'tuvalu',
 'kazakhstan',
 'cotedivoire',
 'fiji',
 'bangladesh',
 'turkmenistan',
 'jordan',
 'monaco',
 'afghanistan',
 'djibouti']

In [52]:
tfidf_df['COUNTRY'] = countries

In [53]:
tfidf_df

Unnamed: 0,abasi,abasi desk,abasi desk officer,abdi,abdi ismael,abdi ismael hersi,abdou,abdou prsident,abdou prsident la,abduction,...,zone inclusive,zone inclusive education,zone senegal,zone senegal make,zone social,zone social benefit,zouon,zouon bi,zouon bi tidou,COUNTRY
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,sanmarino
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,tuvalu
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004509,...,0.006711,0.006711,0.0,0.0,0.006711,0.006711,0.0,0.0,0.0,kazakhstan
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.006249,0.006249,0.006249,cotedivoire
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,fiji
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011198,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,bangladesh
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.007365,0.007365,0.0,0.0,0.0,0.0,0.0,turkmenistan
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004721,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,jordan
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,monaco
9,0.007608,0.007608,0.007608,0.0,0.0,0.0,0.0,0.0,0.0,0.005111,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,afghanistan


### Examine unique words by each document/country

Change the country names to view their highest rated terms.

In [54]:
country = tfidf_df[tfidf_df['COUNTRY'] == 'jordan']
country.max(numeric_only = True).sort_values(ascending = False).head(20)

jordan                                0.491254
jordanian                             0.140540
press publication                     0.112432
syrian                                0.105405
reservation                           0.099133
publication law                       0.091351
press publication law                 0.091351
constitutional amendment              0.089799
syrian refugee                        0.084324
publication                           0.080973
reservation convention                0.078083
reservation convention elimination    0.072077
website                               0.072077
commitment jordan                     0.063243
news website                          0.056216
news                                  0.054058
al                                    0.054058
personal status                       0.054058
personal                              0.052823
host                                  0.046879
dtype: float64

## UN HRC text analysis - what next?

What next? Keep in mind that we have not even begun to consider named entities and parts of speech. What problems immediately jump out from the above examples, such as with the number and uniqueness of country names?

The next two chapters 8 and 9 introduce powerful text preprocessing and analysis techniques. Read ahead to see how we can handle roadblocks such as these. 

## Sentiment analysis

Sentiment analysis is the contextual mining of text data that elicits abstract information in source materials to determine if data are positive, negative, or neutral. 

![sa](img/sa.jpg)

[Repustate](https://www.repustate.com/blog/sentiment-analysis-challenges-with-solutions/)

### Download the nltk built movie reviews dataset

In [55]:
import nltk
from nltk.corpus import movie_reviews
nltk.download("movie_reviews")

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/evanmuzzall/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

### Define x (reviews) and y (judgements) variables

In [56]:
# Extract our x (reviews) and y (judgements) variables
reviews = [movie_reviews.raw(fileid) for fileid in movie_reviews.fileids()]
judgements = [movie_reviews.categories(fileid)[0] for fileid in movie_reviews.fileids()]

In [57]:
# Save in a dataframe
movies = pd.DataFrame({"Reviews" : reviews, 
                      "Judgements" : judgements})
movies.head()

Unnamed: 0,Reviews,Judgements
0,"plot : two teen couples go to a church party ,...",neg
1,the happy bastard's quick movie review \ndamn ...,neg
2,it is movies like these that make a jaded movi...,neg
3,""" quest for camelot "" is warner bros . ' firs...",neg
4,synopsis : a mentally unstable man undergoing ...,neg


In [58]:
movies.shape

(2000, 2)

### Shuffle the reviews

In [59]:
import numpy as np
from sklearn.utils import shuffle
x, y = shuffle(np.array(movies.Reviews), np.array(movies.Judgements), random_state = 1)

In [60]:
# change x[0] and y[0] to see different reviews
x[0], print("Human review was:", y[0])

Human review was: neg


('steve martin is one of the funniest men alive . \nif you can take that as a true statement , then your disappointment at this film will equal mine . \nmartin can be hilarious , creating some of the best laugh-out-loud experiences that have ever taken place in movie theaters . \nyou won\'t find any of them here . \nthe old television series that this is based on has its moments of humor and wit . \nbilko ( and the name isn\'t an accident ) is the head of an army motor pool group , but his passion is his schemes . \nevery episode involves the sergeant and his men in one or another hair-brained plan to get rich quick while outwitting the officers of the base . \n " mchale\'s navy " \'s granddaddy . \nthat\'s the idea behind this movie too , but the difference is that , as far-fetched and usually goofy as the television series was , it was funny . \nthere is not one laugh in the film . \nthe re-make retains the goofiness , but not the entertainment . \neverything is just too clean . \nit

### Pipelines - one example

scikit-learn offers hand ways to build machine learning pipelines: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

In [61]:
# standard training/test split (no cross validation)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.30, random_state = 0)

# get tfidf values
tfidf = TfidfVectorizer()
tfidf.fit(x)
x_train = tfidf.transform(x_train)
x_test = tfidf.transform(x_test)

# instantiate, train, and test an logistic regression model
logit_class = LogisticRegression(solver = 'liblinear',
                                 penalty = 'l2', 
                                 C = 1000, 
                                 random_state = 1)
model = logit_class.fit(x_train, y_train)

In [62]:
# test set accuracy
model.score(x_test, y_test)

0.8216666666666667

### $k$-fold cross-validated model

In [63]:
# Cross-validated model!
text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1, 3))),
                    ('tfidf', TfidfTransformer()),
                    ('clf', LogisticRegression(solver = 'liblinear',
                                               penalty = 'l2', 
                                               C = 1000, 
                                               random_state = 1))
                     ])

# for your own research, thesis, or publication
# you would select cv equal to 10 or 20
scores = cross_val_score(text_clf, x, y, cv = 3)

print(scores, np.mean(scores))

[0.8155922  0.79910045 0.80630631] 0.8069996533264899


### Top 25 features for positive and negative reviews

In [64]:
feature_names = tfidf.get_feature_names()
top25pos = np.argsort(model.coef_[0])[-25:]
print("Top features for positive reviews:")
print(list(feature_names[j] for j in top25pos))
print()
print("Top features for negative reviews:")
top25neg = np.argsort(model.coef_[0])[:25]
print(list(feature_names[j] for j in top25neg))

Top features for positive reviews:
['gas', 'perfectly', 'family', 'political', 'will', 'seen', 'rocky', 'always', 'different', 'excellent', 'also', 'many', 'is', 'matrix', 'trek', 'well', 'definitely', 'truman', 'very', 'great', 'quite', 'fun', 'jackie', 'as', 'and']

Top features for negative reviews:
['bad', 'only', 'plot', 'worst', 'there', 'boring', 'script', 'why', 'have', 'unfortunately', 'dull', 'poor', 'any', 'waste', 'nothing', 'looks', 'ridiculous', 'supposed', 'no', 'even', 'harry', 'awful', 'then', 'reason', 'wasted']




In [65]:
new_bad_review = "This was the most awful worst super bad movie ever!"

features = tfidf.transform([new_bad_review])

model.predict(features)

array(['neg'], dtype=object)

In [66]:
new_good_review = 'WHAT A WONDERFUL, FANTASTIC MOVIE!!!'

features = tfidf.transform([new_good_review])

model.predict(features)

array(['pos'], dtype=object)

In [67]:
# try a more complex statement
my_review = 'I hated this movie, even though my friend loved it'
my_features = tfidf.transform([my_review])
model.predict(my_features)

array(['neg'], dtype=object)

## Exercises - text classification

1. Practice your text pre-processing skills on the classic novel Dracula! Here you'll just be performing the standardization operations on a text string instead of a DataFrame, so be sure to adapt the practices you saw with the UN HRC corpus processing appropriately. 

    Can you:
    * Remove non-alphanumeric characters & punctuation?
    * Remove digits?
    * Remove unicode characters?
    * Remove extraneous spaces?
    * Standardize casing?
    * Lemmatize tokens?

2. Investigate classic horror novel vocabulary. Create a single TF-IDF sparse matrix that contains the vocabulary for _Frankenstein_ and _Dracula_. You should only have two rows (one for each of these novels), but potentially thousands of columns to represent the vocabulary across the two texts. What are the 20 most unique words in each? Make a dataframe or visualization to illustrate the differences.

3. [Read through this 20 newsgroups dataset example](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) to get familiar with newspaper data. Do you best to understand and explain what is happening at each step of the workflow. "The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date."

## Improving preprocessing accuracy and efficiency

Remember these are just the basics. There are more efficient ways to preprocess your text that you will want to consider. Read Chapter 8 "spaCy and textaCy" to learn more!