Chapter 7 - English text preprocessing basics - and applications#

2022 August 26

Unstructured text - text you find in the wild in books and websites - is generally not amenable to analysis. Before it can be analyzed, the text needs to be standardized to a format so that Python can recognize each unit of meaning (called a “token”) as unique, no matter how many times it occurs and how it is stylized.

Although not an exhaustive list, some key steps in preprocessing text include:

  • Standardize text casing and spacing

  • Remove punctuation and special characters/symbols

  • Remove stop words

  • Stem or lemmatize: convert all non-base words to their base form

Stemming/lemmatization and stop words (and some punctuation) are language-specific. The Natural Language ToolKit (NLTK) works for English out-of-the-box, but you’ll need different code to work with other languages. Some languages (e.g. Chinese) also require segmentation: artificially inserting spaces between words. If you want to do text pre-processing for other languages, please let us know and we can help!

# Ensure you have the proper nltk modules
import nltk'words')'stopwords')'wordnet')'averaged_perceptron_tagger')'maxent_ne_chunker')'omw-1.4')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from string import punctuation
import pandas as pd
import seaborn as sns
from collections import Counter
import regex as re

import os
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
import spacy
import nltk
from nltk.corpus import movie_reviews
import numpy as np
from sklearn.utils import shuffle
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import roc_curve, roc_auc_score, classification_report, accuracy_score, confusion_matrix 

import warnings
warnings.filterwarnings("ignore", category = DeprecationWarning)
Corpus definition: United Nations Human Rights Council Documentation#


We will select eleven .txt files from the UN HRC as our corpus, stored within the subfolder “human_rights” folder inside the main “data” directory.

These documents contain information about human rights recommendations made by member nations towards countries deemed to be in violation of the HRC.

Learn more about the UN HRC by clicking here.

Define the corpus directory#

Set the directory’s file path and print the files it contains.

# Make the directory "human_rights" inside of data
!mkdir data
!mkdir data/human_rights
# If your "data" folder already exists in Colab and you want to delete it, type:
# !rm -r data

# If the "human_rights" folder already exists in Colab and you want to delete it, type:
# !rm -r data/human_rights
# Download elevent UN HRC files
# !wget -P data/human_rights/
# !wget -P data/human_rights/
# !wget -P data/human_rights/
# !wget -P data/human_rights/
# !wget -P data/human_rights/
# !wget -P data/human_rights/
# !wget -P data/human_rights/
# !wget -P data/human_rights/
# !wget -P data/human_rights/
# !wget -P data/human_rights/
# !wget -P data/human_rights/
# Check that we have eleven files, one for each country
!ls data/human_rights/
import os
corpus = os.listdir('data/human_rights/')

# View the contents of this directory

Store these documents in a data frame#

# Store in an empty dictionary for conversion to data frame
empty_dictionary = {}

# Loop through the folder of documents to open and read each one
for document in corpus:
    with open('data/human_rights/' + document, 'r', encoding = 'utf-8') as to_open:
         empty_dictionary[document] =

# Populate the data frame with two columns: file name and document text
human_rights = (pd.DataFrame.from_dict(empty_dictionary, 
                                       orient = 'index')
                .reset_index().rename(index = str, 
                                      columns = {'index': 'file_name', 0: 'document_text'}))

View the data frame#

file_name document_text
0 sanmarino2014.txt \n United Nations \n A/HRC/28/9 \n \n \n\n Ge...
1 tuvalu2013.txt \n United Nations \n A/HRC/24/8 \n \n \n\n G...
2 kazakhstan2014.txt \n United Nations \n A/HRC/28/10 \n \n \n\n G...
3 cotedivoire2014.txt \nDistr.: General 7 July 2014 English Original...
4 fiji2014.txt \n United Nations \n A/HRC/28/8 \n \n \n\n Ge...
5 bangladesh2013.txt \n United Nations \n A/HRC/24/12 \n \n \n\n ...
6 turkmenistan2013.txt \n United Nations \n A/HRC/24/3 \n \n \n\n G...
7 jordan2013.txt \nDistr.: General 6 January 2014 \nOriginal: E...
8 monaco2013.txt \nDistr.: General 3 January 2014 English Origi...
9 afghanistan2014.txt \nDistr.: General 4 April 2014 \nOriginal: Eng...
10 djibouti2013.txt \n\nDistr.: General 8 July 2013 English Origin...

View the text of the first document#

# first thousand characters
 United Nations 

 General Assembly 
 Distr.: General 
24 December 2014 
Original: English 

Human Rights Council 

Twenty-eighth session 
Agenda item 6 
Universal Periodic Review 
  Report of the Working Group on the Universal Periodic Review* 
 * The annex to the present report is circulated as received. 
  San Marino 
 Paragraphs Page 
  Introduction .............................................................................................................  1Ð4 3 
 I. Summary of the proceedings of the review process ................................................  5Ð77 3 
  A. Presentation by the State under review ...........................................................  5Ð23 3 
  B. Interactive dialogue and responses by the State under review ........................  24Ð77 6 
 II. Conclusions and/or recommendations .....................................................................  78Ð81 13 
  Composition of the delegation .......

English text preprocessing#

Create a new column named “clean_text” to store the text as it is preprocessed.

What are some of the things we can do?#

These are just a few examples. How else could you improve this process?

  • Remove non-alphanumeric characters/punctuation

  • Remove digits

  • Remove unicode characters

  • Remove extra spaces

  • Convert to lowercase

  • Lemmatize (optional for now)

Take a look at the first document after each step to see if you can notice what changed.

Remember: the process will likely be different for many other natural languages, which frequently require special considerations.

Remove non-alphanumeric characters/punctuation#

# Create a new column 'clean_text' to store the text we are standardizing
human_rights['clean_text'] = human_rights['document_text'].str.replace(r'[^\w\s]', ' ', regex = True)
 United Nations 
 A HRC 28 9 

 General Assembly 
 Distr   General 
24 December 2014 
Original  English 

Human Rights Council 

Twenty eighth session 
Agenda item 6 
Universal Periodic Review 
  Report of the Working Group on the Universal Periodic Review  
   The annex to the present report is circulated as received  
  San Marino 
 Paragraphs Page 
  Introduction                                                                                                                1Ð4 3 
 I  Summary of the proceedings of the review process                                                   5Ð77 3 
  A  Presentation by the State under review                                                              5Ð23 3 
  B  Interactive dialogue and responses by the State under review                           24Ð77 6 
 II  Conclusions and or recommendations                                                                        78Ð81 13 
  Composition of the delegation        
# view third column
file_name document_text clean_text
0 sanmarino2014.txt \n United Nations \n A/HRC/28/9 \n \n \n\n Ge... \n United Nations \n A HRC 28 9 \n \n \n\n Ge...
1 tuvalu2013.txt \n United Nations \n A/HRC/24/8 \n \n \n\n G... \n United Nations \n A HRC 24 8 \n \n \n\n G...
2 kazakhstan2014.txt \n United Nations \n A/HRC/28/10 \n \n \n\n G... \n United Nations \n A HRC 28 10 \n \n \n\n G...
3 cotedivoire2014.txt \nDistr.: General 7 July 2014 English Original... \nDistr General 7 July 2014 English Original...
4 fiji2014.txt \n United Nations \n A/HRC/28/8 \n \n \n\n Ge... \n United Nations \n A HRC 28 8 \n \n \n\n Ge...
5 bangladesh2013.txt \n United Nations \n A/HRC/24/12 \n \n \n\n ... \n United Nations \n A HRC 24 12 \n \n \n\n ...
6 turkmenistan2013.txt \n United Nations \n A/HRC/24/3 \n \n \n\n G... \n United Nations \n A HRC 24 3 \n \n \n\n G...
7 jordan2013.txt \nDistr.: General 6 January 2014 \nOriginal: E... \nDistr General 6 January 2014 \nOriginal E...
8 monaco2013.txt \nDistr.: General 3 January 2014 English Origi... \nDistr General 3 January 2014 English Origi...
9 afghanistan2014.txt \nDistr.: General 4 April 2014 \nOriginal: Eng... \nDistr General 4 April 2014 \nOriginal Eng...
10 djibouti2013.txt \n\nDistr.: General 8 July 2013 English Origin... \n\nDistr General 8 July 2013 English Origin...

Remove digits#

human_rights['clean_text'] = human_rights['clean_text'].str.replace(r'\d', ' ', regex = True)
 United Nations 
 A HRC      

 General Assembly 
 Distr   General 
Original  English 

Human Rights Council 

Twenty eighth session 
Agenda item   
Universal Periodic Review 
  Report of the Working Group on the Universal Periodic Review  
   The annex to the present report is circulated as received  
  San Marino 
 Paragraphs Page 
  Introduction                                                                                                                 Ð    
 I  Summary of the proceedings of the review process                                                    Ð     
  A  Presentation by the State under review                                                               Ð     
  B  Interactive dialogue and responses by the State under review                             Ð     
 II  Conclusions and or recommendations                                                                          Ð      
  Composition of the delegation        

Remove unicode characters such as Ð and ð#

# for more on text encodings:
human_rights['clean_text'] = human_rights['clean_text'].str.encode('ascii', 'ignore').str.decode('ascii')
 United Nations 
 A HRC      

 General Assembly 
 Distr   General 
Original  English 

Human Rights Council 

Twenty eighth session 
Agenda item   
Universal Periodic Review 
  Report of the Working Group on the Universal Periodic Review  
   The annex to the present report is circulated as received  
  San Marino 
 Paragraphs Page 
 I  Summary of the proceedings of the review process                                                         
  A  Presentation by the State under review                                                                    
  B  Interactive dialogue and responses by the State under review                                  
 II  Conclusions and or recommendations                                                                                
  Composition of the delegation             

Remove extra spaces#

import regex as re
human_rights['clean_text'] = human_rights['clean_text'].str.replace(r'\s+', ' ', regex = True)
 United Nations A HRC General Assembly Distr General December Original English Human Rights Council Twenty eighth session Agenda item Universal Periodic Review Report of the Working Group on the Universal Periodic Review The annex to the present report is circulated as received San Marino Contents Paragraphs Page Introduction I Summary of the proceedings of the review process A Presentation by the State under review B Interactive dialogue and responses by the State under review II Conclusions and or recommendations Annex Composition of the delegation Introduction The Working Group on the Universal Periodic Review established in accordance with Human Rights Council resolution of June held its twentieth session from October to November The review of San Marino was held at the th meeting on October The delegation of San Marino was headed by Pasquale Valentini Minister for Foreign Affairs At its th meeting held on October the Working Group adopted the report on San Marino On January the Hu

Convert to lowercase#

human_rights['clean_text'] = human_rights['clean_text'].str.lower()
 united nations a hrc general assembly distr general december original english human rights council twenty eighth session agenda item universal periodic review report of the working group on the universal periodic review the annex to the present report is circulated as received san marino contents paragraphs page introduction i summary of the proceedings of the review process a presentation by the state under review b interactive dialogue and responses by the state under review ii conclusions and or recommendations annex composition of the delegation introduction the working group on the universal periodic review established in accordance with human rights council resolution of june held its twentieth session from october to november the review of san marino was held at the th meeting on october the delegation of san marino was headed by pasquale valentini minister for foreign affairs at its th meeting held on october the working group adopted the report on san marino on january the hu


# !python -m spacy download en_core_web_sm
# !python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_sm')
human_rights['clean_text'] = human_rights['clean_text'].apply(lambda row: ' '.join([w.lemma_ for w in nlp(row)]))
View the updated data frame#

file_name document_text clean_text
0 sanmarino2014.txt \n United Nations \n A/HRC/28/9 \n \n \n\n Ge... united nations a hrc general assembly distr ...
1 tuvalu2013.txt \n United Nations \n A/HRC/24/8 \n \n \n\n G... united nations a hrc general assembly distr ...
2 kazakhstan2014.txt \n United Nations \n A/HRC/28/10 \n \n \n\n G... united nations a hrc general assembly distr ...
3 cotedivoire2014.txt \nDistr.: General 7 July 2014 English Original... distr general july english original english ...
4 fiji2014.txt \n United Nations \n A/HRC/28/8 \n \n \n\n Ge... united nations a hrc general assembly distr ...
5 bangladesh2013.txt \n United Nations \n A/HRC/24/12 \n \n \n\n ... united nations a hrc general assembly distr ...
6 turkmenistan2013.txt \n United Nations \n A/HRC/24/3 \n \n \n\n G... united nations a hrc general assembly distr ...
7 jordan2013.txt \nDistr.: General 6 January 2014 \nOriginal: E... distr general january original english gener...
8 monaco2013.txt \nDistr.: General 3 January 2014 English Origi... distr general january english original engli...
9 afghanistan2014.txt \nDistr.: General 4 April 2014 \nOriginal: Eng... distr general april original english general...
10 djibouti2013.txt \n\nDistr.: General 8 July 2013 English Origin... distr general july english original english ...

Exercises - redwoods webscraping#

This also works with data scraped from the web. Below is very brief BeautifulSoup example to save the contents of the Sequoioideae (redwood trees) Wikipedia page in a variable named text.

  1. Read through the code below

  2. Practice by repeating for a webpage of your choice


# import necessary libraries
from bs4 import BeautifulSoup
import requests
import regex as re
import nltk

Three variables will get you started#

  1. url - define the URL to be scraped

  2. response - perform the get request on the URL

  3. soup - create the soup object so we can parse the html

url = ""
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html')

Get the text#

HTML (hypertext markup language) is used to structure a webpage and the content it contains, including text.

Below is a handy for loop that finds all everything within paragraph <p>, or paragraph tags.

# save in an empty string
text = ""

for paragraph in soup.find_all('p'):
    text += paragraph.text
Sequoioideae, popularly known as redwoods, is a subfamily of coniferous trees within the family Cupressaceae. It includes the largest and tallest trees in the world.
The three redwood subfamily genera are Sequoia from coastal California and Oregon, Sequoiadendron from California's Sierra Nevada, and Metasequoia in China. The redwood species contains the largest and tallest trees in the world. These trees can live for thousands of years. Threats include logging, fire suppression,[2] climate change, illegal marijuana cultivation, and burl poaching.[3][4][5]
Only two of the genera, Sequoia and Sequoiadendron, are known for massive trees. Trees of Metasequoia, from the single living species Metasequoia glyptostroboides, are much smaller.
Multiple studies of both morphological and molecular characters have strongly supported the assertion that the Sequoioideae are monophyletic.[6][7][8][9]
Most modern phylogenies place Sequoia as sister to Sequoiadendron and Metasequoia as the out-group.[7][9][10] However, Yang et al. went on to investigate the origin of a peculiar genetic artifact of the Sequoioideae—the polyploidy of Sequoia—and generated a notable exception that calls into question the specifics of this relative consensus.[9]
A 2006 paper based on non-molecular evidence suggested the following relationship among extant species:[11]

M. glyptostroboides (dawn redwood)
S. sempervirens (coast redwood)
S. giganteum (giant sequoia)
A 2021 study using molecular evidence found the same relationships among Sequoioideae species, but found Sequoioideae to be the sister group to the Athrotaxidoideae (a superfamily presently known only from Tasmania) rather than to Taxodioideae. Sequoioideae and Athrotaxidoideae are thought to have diverged from each other during the Jurassic.[12]
Reticulate evolution refers to the origination of a taxon through the merging of ancestor lineages.
Polyploidy has come to be understood as quite common in plants—with estimates ranging from 47% to 100% of flowering plants and extant ferns having derived from ancient polyploidy.[13] Within the gymnosperms however it is quite rare. Sequoia sempervirens is hexaploid (2n= 6x= 66). To investigate the origins of this polyploidy Yang et al. used two single copy nuclear genes, LFY and NLY, to generate phylogenetic trees. Other researchers have had success with these genes in similar studies on different taxa.[9]
Several hypotheses have been proposed to explain the origin of Sequoia's polyploidy: allopolyploidy by hybridization between Metasequoia and some probably extinct taxodiaceous plant; Metasequoia and Sequoiadendron, or ancestors of the two genera, as the parental species of Sequoia; and autohexaploidy, autoallohexaploidy, or segmental allohexaploidy.
Yang et al. found that Sequoia was clustered with Metasequoia in the tree generated using the LFY gene but with Sequoiadendron in the tree generated with the NLY gene. Further analysis strongly supported the hypothesis that Sequoia was the result of a hybridization event involving Metasequoia and Sequoiadendron. Thus, Yang et al. hypothesize that the inconsistent relationships among Metasequoia, Sequoia, and Sequoiadendron could be a sign of reticulate evolution by hybrid speciation (in which two species hybridize and give rise to a third) among the three genera. However, the long evolutionary history of the three genera (the earliest fossil remains being from the Jurassic) make resolving the specifics of when and how Sequoia originated once and for all a difficult matter—especially since it in part depends on an incomplete fossil record.[10]
Sequoioideae is an ancient taxon, with the oldest described Sequoioideae species, Sequoia jeholensis, recovered from Jurassic deposits.[14]  A genus Medulloprotaxodioxylon, reported from the late Triassic of China supports the idea of a Norian origin.[1]
The fossil record shows a massive expansion of range in the Cretaceous and dominance of the Arcto-Tertiary Geoflora, especially in northern latitudes. Genera of Sequoioideae were found in the Arctic Circle, Europe, North America, and throughout Asia and Japan.[15] A general cooling trend beginning in the late Eocene and Oligocene reduced the northern ranges of the Sequoioideae, as did subsequent ice ages.[16] Evolutionary adaptations to ancient environments persist in all three species despite changing climate, distribution, and associated flora, especially the specific demands of their reproduction ecology that ultimately forced each of the species into refugial ranges where they could survive.
The entire subfamily is endangered.  The IUCN Red List Category & Criteria assesses Sequoia sempervirens as Endangered (A2acd), Sequoiadendron giganteum as Endangered (B2ab) and Metasequoia glyptostroboides as Endangered (B1ab).
The two California redwood species, since the early 19th century, and the Chinese redwood species since 1948, have been cultivated horticulturally far beyond their native habitats. They are found in botanical gardens, public parks, and private landscapes in many similar climates worldwide. Plantings outside their native ranges particularly are found in California, the coastal Northwestern and the Eastern United States, areas of China, Ireland,[17] Germany, the United Kingdom, Australia and near Rotorua, New Zealand.[18] They are also used in educational projects recreating the look of the megaflora of the Pleistocene landscape.
New World Species:
New World Species:

Regular expressions#

Regular expressions are sequences of characters and symbols that represent search patterns in text - and are generally quite useful.

Check out the tutorial and cheatsheet to find out what the below symbols mean and write your own code. Better yet you could write a pattern to do them simultaneously in one line/less lines of code in some cases!

text = re.sub(r'\[[0-9]*\]',' ',text)
text = re.sub(r'\s+',' ',text)
text = re.sub(r'\d',' ',text)
text = re.sub(r'[^\w\s]','',text)
text = text.lower()
text = re.sub(r'\s+',' ',text)
# print(text)

Unsupervised learning with TfidfVectorizer()#

Remember CountVectorizer() for creating Bag of Word models? We can extend this idea of counting words, to counting unique words within a document relative to the rest of the corpus with TfidfVectorizer(). Each row will still be a document in the document term matrix and each column will still be a linguistic feature, but the cells will now be populated by the word uniqueness weights instead of frequencies. Remember that:

  • For TF-IDF sparse matrices:

    • A value closer to 1 indicate that a feature is more relevant to a particular document.

    • A value closer to 0 indicates that that feature is less/not relevant to that document.





from sklearn.feature_extraction.text import TfidfVectorizer

tf_vectorizer = TfidfVectorizer(ngram_range = (1, 3), 
                                stop_words = 'english', 
                                max_df = 0.50
tf_sparse = tf_vectorizer.fit_transform(human_rights['clean_text'])
(11, 84181)
  (0, 39182)	0.004856846879037771
  (0, 31574)	0.004856846879037771
  (0, 50165)	0.004856846879037771
  (0, 79743)	0.004856846879037771
  (0, 46164)	0.004856846879037771
  (0, 70574)	0.004856846879037771
  (0, 67048)	0.004856846879037771
  (0, 48393)	0.004856846879037771
  (0, 55413)	0.004856846879037771
  (0, 5657)	0.004856846879037771
  (0, 2036)	0.004856846879037771
  (0, 18508)	0.004856846879037771
  (0, 4238)	0.004856846879037771
  (0, 49342)	0.004856846879037771
  (0, 2719)	0.004856846879037771
  (0, 39331)	0.004856846879037771
  (0, 7341)	0.004856846879037771
  (0, 80381)	0.004856846879037771
  (0, 49382)	0.004856846879037771
  (0, 2723)	0.004856846879037771
  (0, 43326)	0.004856846879037771
  (0, 20394)	0.004856846879037771
  (0, 27591)	0.004856846879037771
  (0, 53796)	0.004856846879037771
  (0, 74877)	0.004856846879037771
  :	:
  (10, 23861)	0.004813404453682774
  (10, 53868)	0.005386104036626412
  (10, 61809)	0.005386104036626412
  (10, 11732)	0.006124442213746767
  (10, 52233)	0.006124442213746767
  (10, 32036)	0.00869094964616284
  (10, 38685)	0.00869094964616284
  (10, 48447)	0.012248884427493533
  (10, 64367)	0.00869094964616284
  (10, 28795)	0.004813404453682774
  (10, 66492)	0.03231662421975847
  (10, 62562)	0.005386104036626412
  (10, 30131)	0.004813404453682774
  (10, 79340)	0.006124442213746767
  (10, 4251)	0.00434547482308142
  (10, 27107)	0.00434547482308142
  (10, 57250)	0.026072848938488522
  (10, 60838)	0.030418323761569943
  (10, 70491)	0.026072848938488522
  (10, 76162)	0.00434547482308142
  (10, 72779)	0.005386104036626412
  (10, 27613)	0.00434547482308142
  (10, 28707)	0.010772208073252824
  (10, 8151)	0.010772208073252824
  (10, 18289)	0.00869094964616284

Convert the tfidf sparse matrix to data frame#

tfidf_df = pd.DataFrame(tf_sparse.todense(), columns = tf_vectorizer.get_feature_names())
abasi abasi desk abasi desk officer abdi abdi ismael abdi ismael hersi abdou abdou prsident abdou prsident la abduction ... zone zone inclusive zone inclusive education zone senegal zone senegal make zone social zone social benefit zouon zouon bi zouon bi tidou
0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.004509 ... 0.011473 0.006711 0.006711 0.000000 0.000000 0.006711 0.006711 0.000000 0.000000 0.000000
3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.006249 0.006249 0.006249
4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
5 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.011198 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.006296 0.000000 0.000000 0.007365 0.007365 0.000000 0.000000 0.000000 0.000000 0.000000
7 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.004721 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
8 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
9 0.007608 0.007608 0.007608 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.005111 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
10 0.000000 0.000000 0.000000 0.007165 0.007165 0.007165 0.007165 0.007165 0.007165 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

11 rows × 84181 columns

View 20 highest weighted words#

tfidf_df.max().sort_values(ascending = False).head(n = 20)
monaco                        0.720348
tuvalu                        0.639251
kazakhstan                    0.615483
fiji                          0.582410
turkmenistan                  0.578904
san                           0.553681
jordan                        0.491254
san marino                    0.456544
marino                        0.456544
divoire                       0.340046
te divoire                    0.340046
te                            0.306989
elimination violence          0.253620
elimination violence woman    0.253620
djiboutis                     0.250777
reconciliation                0.245711
fgm                           0.195982
afghan                        0.190201
bangladeshs                   0.183356
violence woman law            0.182593
dtype: float64

Add country name to tfidf_df#

This way, we will know which document is relative to which country.

# wrangle the country names from the human_rights data frame
countries = human_rights['file_name'].str.slice(stop = -8)
countries = list(countries)
tfidf_df['COUNTRY'] = countries
abasi abasi desk abasi desk officer abdi abdi ismael abdi ismael hersi abdou abdou prsident abdou prsident la abduction ... zone inclusive zone inclusive education zone senegal zone senegal make zone social zone social benefit zouon zouon bi zouon bi tidou COUNTRY
0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 sanmarino
1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 tuvalu
2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.004509 ... 0.006711 0.006711 0.000000 0.000000 0.006711 0.006711 0.000000 0.000000 0.000000 kazakhstan
3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.006249 0.006249 0.006249 cotedivoire
4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 fiji
5 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.011198 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 bangladesh
6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.007365 0.007365 0.000000 0.000000 0.000000 0.000000 0.000000 turkmenistan
7 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.004721 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 jordan
8 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 monaco
9 0.007608 0.007608 0.007608 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.005111 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 afghanistan
10 0.000000 0.000000 0.000000 0.007165 0.007165 0.007165 0.007165 0.007165 0.007165 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 djibouti

11 rows × 84182 columns

Examine unique words by each document/country#

Change the country names to view their highest rated terms.

country = tfidf_df[tfidf_df['COUNTRY'] == 'jordan']
country.max(numeric_only = True).sort_values(ascending = False).head(20)
jordan                                0.491254
jordanian                             0.140540
press publication                     0.112432
syrian                                0.105405
reservation                           0.099133
publication law                       0.091351
press publication law                 0.091351
constitutional amendment              0.089799
syrian refugee                        0.084324
publication                           0.080973
reservation convention                0.078083
reservation convention elimination    0.072077
website                               0.072077
commitment jordan                     0.063243
news website                          0.056216
news                                  0.054058
al                                    0.054058
personal status                       0.054058
personal                              0.052823
host                                  0.046879
dtype: float64

UN HRC text analysis - what next?#

What next? Keep in mind that we have not even begun to consider named entities and parts of speech. What problems immediately jump out from the above examples, such as with the number and uniqueness of country names?

The next two chapters 8 and 9 introduce powerful text preprocessing and analysis techniques. Read ahead to see how we can handle roadblocks such as these.

Sentiment analysis#

Sentiment analysis is the contextual mining of text data that elicits abstract information in source materials to determine if data are positive, negative, or neutral.



Download the nltk built movie reviews dataset#

import nltk
from nltk.corpus import movie_reviews"movie_reviews")
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/evanmuzzall/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!

Define x (reviews) and y (judgements) variables#

# Extract our x (reviews) and y (judgements) variables
reviews = [movie_reviews.raw(fileid) for fileid in movie_reviews.fileids()]
judgements = [movie_reviews.categories(fileid)[0] for fileid in movie_reviews.fileids()]
# Save in a dataframe
movies = pd.DataFrame({"Reviews" : reviews, 
                      "Judgements" : judgements})
Reviews Judgements
0 plot : two teen couples go to a church party ,... neg
1 the happy bastard's quick movie review \ndamn ... neg
2 it is movies like these that make a jaded movi... neg
3 " quest for camelot " is warner bros . ' firs... neg
4 synopsis : a mentally unstable man undergoing ... neg
(2000, 2)

Shuffle the reviews#

import numpy as np
from sklearn.utils import shuffle
x, y = shuffle(np.array(movies.Reviews), np.array(movies.Judgements), random_state = 1)
# change x[0] and y[0] to see different reviews
x[0], print("Human review was:", y[0])
Human review was: neg
('steve martin is one of the funniest men alive . \nif you can take that as a true statement , then your disappointment at this film will equal mine . \nmartin can be hilarious , creating some of the best laugh-out-loud experiences that have ever taken place in movie theaters . \nyou won\'t find any of them here . \nthe old television series that this is based on has its moments of humor and wit . \nbilko ( and the name isn\'t an accident ) is the head of an army motor pool group , but his passion is his schemes . \nevery episode involves the sergeant and his men in one or another hair-brained plan to get rich quick while outwitting the officers of the base . \n " mchale\'s navy " \'s granddaddy . \nthat\'s the idea behind this movie too , but the difference is that , as far-fetched and usually goofy as the television series was , it was funny . \nthere is not one laugh in the film . \nthe re-make retains the goofiness , but not the entertainment . \neverything is just too clean . \nit was obviously made on a hollywood back lot and looks every bit like it . \nit all looks brand new , even the old beat-up stuff . \nmartin is remarkably small in what should have been a bigger than life role . \nin the original , phil silvers played the huckster with a heart of gold and more than a touch of sleaziness . \nmartin\'s bilko is a pale imitation . \nthe only semi-bright spot is phil hartman as bilko\'s arch-enemy . \nit\'s not saying much , considering martin\'s lackluster character , but hartman leaves him in the dust . \n',

Pipelines - one example#

scikit-learn offers hand ways to build machine learning pipelines:

# standard training/test split (no cross validation)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.30, random_state = 0)

# get tfidf values
tfidf = TfidfVectorizer()
x_train = tfidf.transform(x_train)
x_test = tfidf.transform(x_test)

# instantiate, train, and test an logistic regression model
logit_class = LogisticRegression(solver = 'liblinear',
                                 penalty = 'l2', 
                                 C = 1000, 
                                 random_state = 1)
model =, y_train)
# test set accuracy
model.score(x_test, y_test)

\(k\)-fold cross-validated model#

# Cross-validated model!
text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1, 3))),
                    ('tfidf', TfidfTransformer()),
                    ('clf', LogisticRegression(solver = 'liblinear',
                                               penalty = 'l2', 
                                               C = 1000, 
                                               random_state = 1))

# for your own research, thesis, or publication
# you would select cv equal to 10 or 20
scores = cross_val_score(text_clf, x, y, cv = 3)

print(scores, np.mean(scores))
[0.8155922  0.79910045 0.80630631] 0.8069996533264899

Top 25 features for positive and negative reviews#

feature_names = tfidf.get_feature_names()
top25pos = np.argsort(model.coef_[0])[-25:]
print("Top features for positive reviews:")
print(list(feature_names[j] for j in top25pos))
print("Top features for negative reviews:")
top25neg = np.argsort(model.coef_[0])[:25]
print(list(feature_names[j] for j in top25neg))
Top features for positive reviews:
['gas', 'perfectly', 'family', 'political', 'will', 'seen', 'rocky', 'always', 'different', 'excellent', 'also', 'many', 'is', 'matrix', 'trek', 'well', 'definitely', 'truman', 'very', 'great', 'quite', 'fun', 'jackie', 'as', 'and']

Top features for negative reviews:
['bad', 'only', 'plot', 'worst', 'there', 'boring', 'script', 'why', 'have', 'unfortunately', 'dull', 'poor', 'any', 'waste', 'nothing', 'looks', 'ridiculous', 'supposed', 'no', 'even', 'harry', 'awful', 'then', 'reason', 'wasted']
new_bad_review = "This was the most awful worst super bad movie ever!"

features = tfidf.transform([new_bad_review])

array(['neg'], dtype=object)
new_good_review = 'WHAT A WONDERFUL, FANTASTIC MOVIE!!!'

features = tfidf.transform([new_good_review])

array(['pos'], dtype=object)
# try a more complex statement
my_review = 'I hated this movie, even though my friend loved it'
my_features = tfidf.transform([my_review])
array(['neg'], dtype=object)

Exercises - text classification#

  1. Practice your text pre-processing skills on the classic novel Dracula! Here you’ll just be performing the standardization operations on a text string instead of a DataFrame, so be sure to adapt the practices you saw with the UN HRC corpus processing appropriately.

    Can you:

    • Remove non-alphanumeric characters & punctuation?

    • Remove digits?

    • Remove unicode characters?

    • Remove extraneous spaces?

    • Standardize casing?

    • Lemmatize tokens?

  2. Investigate classic horror novel vocabulary. Create a single TF-IDF sparse matrix that contains the vocabulary for Frankenstein and Dracula. You should only have two rows (one for each of these novels), but potentially thousands of columns to represent the vocabulary across the two texts. What are the 20 most unique words in each? Make a dataframe or visualization to illustrate the differences.

  3. Read through this 20 newsgroups dataset example to get familiar with newspaper data. Do you best to understand and explain what is happening at each step of the workflow. “The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.”

Improving preprocessing accuracy and efficiency#

Remember these are just the basics. There are more efficient ways to preprocess your text that you will want to consider. Read Chapter 8 “spaCy and textaCy” to learn more!