Chapter 8 - spaCy and textaCy#

2023 April 28

These abridged materials are borrowed from the CIDR Workshop Text Analysis with Python

Why spaCy and textacy?#

The language processing features of spaCy and the corpus analysis methods of textacy together offer a wide range of functionality for text analysis in a well-maintained and well-documented software package that incorporates cutting-edge techniques as well as standard approaches.

The “C” in spaCy (and textacy) stands for Cython, which is Python that is compiled to C code and thus offers some performance advantages over interpreted Python, especially when working with large machine-learning models. The use of machine-learning models, including neural networks, is a key feature of spaCy and textacy. The writers of these libraries also have developed Prodigy, a similarly leading-edge but approachable tool for training custom machine-learning models for text analysis, among other uses.

Check out the spaCy 101 to learn more.

Topics#

Document Tokenization
Part-of-Speech (POS) Tagging
Named-Entity Recognition (NER)
Corpus Vectorization
Topic Modeling
Document Similarity
Stylistic Analysis

Note: The examples from this workshop use English texts, but all of the methods are applicable to other languages. The availability of specialized resources (parsing rules, dictionaries, trained models) can vary considerably by language, however.

A brief word about terms#

Text analysis involves extraction of information from significant amounts of free-form text, e.g., literature (prose, poetry), historical records, long-form survey responses, legal documents. Some of the techniques used also are applicable to short-form text data, including documents that are already in tabular format.

Text analysis methods are built upon techniques for Natural Language Processing (NLP), which began as rule-based approaches to parsing human language and eventually incorporated statistical machine learning methods as well as, most recently, neural network/deep learning-based approaches.

Text mining typically refers to the extraction of information from very large corpora of unstructured texts.

# !pip install textacy
import spacy
import textacy

2023-05-20 14:49:20.160588: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"

Document-level analysis with `spaCy`#

Let’s start by learning how spaCy works and using it to begin analyzing a single text document. We’ll work with larger corpora later in the workshop.

For this workshop we will work with a pre-trained statistical and deep-learning model provided by spaCy to process text. spaCy’s models are differentiated by language (21 languages are supported at present), capabilities, training text, and size. Smaller models are more efficient; larger models are more accurate. Here we’ll download and use a medium-sized English multi-task model, which supports part of speech tagging, entity recognition, and includes a word vector model.

!python -m spacy download en_core_web_md

2023-05-20 14:49:27.799248: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"

WARNING: Ignoring invalid distribution -lotly (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
WARNING: Ignoring invalid distribution -ensorflow (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
WARNING: Ignoring invalid distribution -lotly (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)

WARNING: Ignoring invalid distribution -ensorflow (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
Collecting en-core-web-md==3.3.0

  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.3.0/en_core_web_md-3.3.0-py3-none-any.whl (33.5 MB)
?25l     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/33.5 MB ? eta -:--:--

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.2/33.5 MB 4.4 MB/s eta 0:00:08
     ╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.7/33.5 MB 10.7 MB/s eta 0:00:04

     ━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.9/33.5 MB 18.6 MB/s eta 0:00:02

     ━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/33.5 MB 23.6 MB/s eta 0:00:02

     ━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.8/33.5 MB 32.6 MB/s eta 0:00:01
     ━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.4/33.5 MB 36.6 MB/s eta 0:00:01

     ━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.4/33.5 MB 36.6 MB/s eta 0:00:01

     ━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.4/33.5 MB 36.6 MB/s eta 0:00:01
     ━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.0/33.5 MB 27.3 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━ 11.8/33.5 MB 37.1 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━ 15.2/33.5 MB 42.4 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━ 18.8/33.5 MB 86.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━ 21.7/33.5 MB 88.0 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━ 24.5/33.5 MB 83.0 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━ 27.3/33.5 MB 80.9 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━ 30.5/33.5 MB 78.8 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 33.4/33.5 MB 84.9 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 33.4/33.5 MB 84.9 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 33.4/33.5 MB 84.9 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33.5/33.5 MB 39.4 MB/s eta 0:00:00
?25h

Requirement already satisfied: spacy<3.4.0,>=3.3.0.dev0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from en-core-web-md==3.3.0) (3.3.1)

Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (3.0.6)
Requirement already satisfied: pathy>=0.3.5 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (0.10.1)
Requirement already satisfied: wasabi<1.1.0,>=0.9.1 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (0.10.1)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (2.0.6)
Requirement already satisfied: numpy>=1.15.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (1.23.5)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (1.0.1)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (2.0.7)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (2.4.4)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (1.8.2)
Requirement already satisfied: jinja2 in /Users/evanmuzzall/.local/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (3.1.2)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (1.0.7)
Requirement already satisfied: thinc<8.1.0,>=8.0.14 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (8.0.17)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (3.3.0)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.9 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (3.0.10)
Requirement already satisfied: typer<0.5.0,>=0.3.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (0.4.1)
Requirement already satisfied: packaging>=20.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (23.0)
Requirement already satisfied: setuptools in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (66.1.1)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (0.7.7)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (2.28.2)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (4.64.1)

Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from pathy>=0.3.5->spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (5.2.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4->spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (4.4.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (3.0.1)
Requirement already satisfied: certifi>=2017.4.17 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (2022.12.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (1.26.14)
Requirement already satisfied: idna<4,>=2.5 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (3.4)

Requirement already satisfied: click<9.0.0,>=7.1.1 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from typer<0.5.0,>=0.3.0->spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (8.1.3)

Requirement already satisfied: MarkupSafe>=2.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from jinja2->spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (2.1.2)

WARNING: Ignoring invalid distribution -lotly (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)

WARNING: Ignoring invalid distribution -ensorflow (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)

WARNING: Ignoring invalid distribution -lotly (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
WARNING: Ignoring invalid distribution -ensorflow (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
WARNING: Ignoring invalid distribution -lotly (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
WARNING: Ignoring invalid distribution -ensorflow (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
WARNING: Ignoring invalid distribution -lotly (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
WARNING: Ignoring invalid distribution -ensorflow (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)

✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_md')

# Once we've installed the model, we can import it like any other Python library
import en_core_web_md

# This instantiates a spaCy text processor based on the installed model
nlp = en_core_web_md.load()

# From H.G. Wells's A Short History of the World, Project Gutenberg 
text = """Even under the Assyrian monarchs and especially under
Sardanapalus, Babylon had been a scene of great intellectual
activity.  {111} Sardanapalus, though an Assyrian, had been quite
Babylon-ized.  He made a library, a library not of paper but of
the clay tablets that were used for writing in Mesopotamia since
early Sumerian days.  His collection has been unearthed and is
perhaps the most precious store of historical material in the
world.  The last of the Chaldean line of Babylonian monarchs,
Nabonidus, had even keener literary tastes.  He patronized
antiquarian researches, and when a date was worked out by his
investigators for the accession of Sargon I he commemorated the
fact by inscriptions.  But there were many signs of disunion in
his empire, and he sought to centralize it by bringing a number of
the various local gods to Babylon and setting up temples to them
there.  This device was to be practised quite successfully by the
Romans in later times, but in Babylon it roused the jealousy of
the powerful priesthood of Bel Marduk, the dominant god of the
Babylonians.  They cast about for a possible alternative to
Nabonidus and found it in Cyrus the Persian, the ruler of the
adjacent Median Empire.  Cyrus had already distinguished himself
by conquering Croesus, the rich king of Lydia in Eastern Asia
Minor.  {112} He came up against Babylon, there was a battle
outside the walls, and the gates of the city were opened to him
(538 B.C.).  His soldiers entered the city without fighting.  The
crown prince Belshazzar, the son of Nabonidus, was feasting, the
Bible relates, when a hand appeared and wrote in letters of fire
upon the wall these mystical words: _"Mene, Mene, Tekel,
Upharsin,"_ which was interpreted by the prophet Daniel, whom he
summoned to read the riddle, as "God has numbered thy kingdom and
finished it; thou art weighed in the balance and found wanting and
thy kingdom is given to the Medes and Persians."  Possibly the
priests of Bel Marduk knew something about that writing on the
wall.  Belshazzar was killed that night, says the Bible.
Nabonidus was taken prisoner, and the occupation of the city was
so peaceful that the services of Bel Marduk continued without
intermission."""

By default, spaCy applies its entire NLP “pipeline” to the text as soon as it is provided to the model and outputs a processed “doc.”

doc = nlp(text)

Tokenization#

The doc created by spaCy immediately provides access to the word-level tokens of the text.

for token in doc[:15]:
    print(token)

Even
under
the
Assyrian
monarchs
and
especially
under


Sardanapalus
,
Babylon
had
been
a

Each of these tokens has a number of properties, and we’ll look a bit more closely at them in a minute.

spaCy also automatically provides sentence-level segmenting (senticization).

import itertools

for sent in itertools.islice(doc.sents, 10):
    print(sent.text + "\n--\n")

Even under the Assyrian monarchs and especially under
Sardanapalus, Babylon had been a scene of great intellectual
activity.
--

 {111} Sardanapalus, though an Assyrian, had been quite
Babylon-ized.
--

 He made a library, a library not of paper but of
the clay tablets that were used for writing in Mesopotamia since
early Sumerian days.
--

 His collection has been unearthed and is
perhaps the most precious store of historical material in the
world.
--

 The last of the Chaldean line of Babylonian monarchs,
Nabonidus, had even keener literary tastes.
--

 He patronized
antiquarian researches, and when a date was worked out by his
investigators for the accession of Sargon I he commemorated the
fact by inscriptions.
--

 But there were many signs of disunion in
his empire, and he sought to centralize it by bringing a number of
the various local gods to Babylon and setting up temples to them
there.
--

 This device was to be practised quite successfully by the
Romans in later times, but in Babylon it roused the jealousy of
the powerful priesthood of Bel Marduk, the dominant god of the
Babylonians.
--

 They cast about for a possible alternative to
Nabonidus and found it in Cyrus the Persian, the ruler of the
adjacent Median Empire.
--

 Cyrus had already distinguished himself
by conquering Croesus, the rich king of Lydia in Eastern Asia
Minor.
--

You’ll notice that the line breaks in the sample text are making the extracted sentences and also the word-level tokens a bit messy. The simplest way to avoid this is just to replace all single line breaks from the text with spaces before running it throug the spaCy pipeline, i.e., as a preprocessing step.

There are other ways to handle this within the spaCy pipeline; an important feature of spaCy is that every phase of the built-in pipeline can be replaced by a custom module. One could imagine, for example, writing a replacement sentencizer that takes advantage of the presence of two spaces between all sentences in the sample text. But we will leave that as an exercise for the reader.

text_as_line = text.replace("\n", " ")

doc = nlp(text_as_line)

for sent in itertools.islice(doc.sents, 10):
    print(sent.text + "\n--\n")

Even under the Assyrian monarchs and especially under Sardanapalus, Babylon had been a scene of great intellectual activity.
--

 {111} Sardanapalus, though an Assyrian, had been quite Babylon-ized.
--

 He made a library, a library not of paper but of the clay tablets that were used for writing in Mesopotamia since early Sumerian days.
--

 His collection has been unearthed and is perhaps the most precious store of historical material in the world.
--

 The last of the Chaldean line of Babylonian monarchs, Nabonidus, had even keener literary tastes.
--

 He patronized antiquarian researches, and when a date was worked out by his investigators for the accession of Sargon I he commemorated the fact by inscriptions.
--

 But there were many signs of disunion in his empire, and he sought to centralize it by bringing a number of the various local gods to Babylon and setting up temples to them there.
--

 This device was to be practised quite successfully by the Romans in later times, but in Babylon it roused the jealousy of the powerful priesthood of Bel Marduk, the dominant god of the Babylonians.
--

 They cast about for a possible alternative to Nabonidus and found it in Cyrus the Persian, the ruler of the adjacent Median Empire.
--

 Cyrus had already distinguished himself by conquering Croesus, the rich king of Lydia in Eastern Asia Minor.
--

We can collect both words and sentences into standard Python data structures (lists, in this case).

doc.sents

<generator at 0x7f8ecbf66ea0>

sentences = [sent.text for sent in doc.sents]
sentences

['Even under the Assyrian monarchs and especially under Sardanapalus, Babylon had been a scene of great intellectual activity.',
 ' {111} Sardanapalus, though an Assyrian, had been quite Babylon-ized.',
 ' He made a library, a library not of paper but of the clay tablets that were used for writing in Mesopotamia since early Sumerian days.',
 ' His collection has been unearthed and is perhaps the most precious store of historical material in the world.',
 ' The last of the Chaldean line of Babylonian monarchs, Nabonidus, had even keener literary tastes.',
 ' He patronized antiquarian researches, and when a date was worked out by his investigators for the accession of Sargon I he commemorated the fact by inscriptions.',
 ' But there were many signs of disunion in his empire, and he sought to centralize it by bringing a number of the various local gods to Babylon and setting up temples to them there.',
 ' This device was to be practised quite successfully by the Romans in later times, but in Babylon it roused the jealousy of the powerful priesthood of Bel Marduk, the dominant god of the Babylonians.',
 ' They cast about for a possible alternative to Nabonidus and found it in Cyrus the Persian, the ruler of the adjacent Median Empire.',
 ' Cyrus had already distinguished himself by conquering Croesus, the rich king of Lydia in Eastern Asia Minor.',
 ' {112} He came up against Babylon, there was a battle outside the walls, and the gates of the city were opened to him (538 B.C.).',
 ' His soldiers entered the city without fighting.',
 ' The crown prince Belshazzar, the son of Nabonidus, was feasting, the Bible relates, when a hand appeared and wrote in letters of fire upon the wall these mystical words: _"Mene, Mene, Tekel, Upharsin,"_ which was interpreted by the prophet Daniel, whom he summoned to read the riddle, as "God has numbered thy kingdom and finished it; thou art weighed in the balance and found wanting and thy kingdom is given to the Medes and Persians."',
 ' Possibly the priests of Bel Marduk knew something about that writing on the wall.',
 ' Belshazzar was killed that night, says the Bible.',
 'Nabonidus was taken prisoner, and the occupation of the city was so peaceful that the services of Bel Marduk continued without intermission.']

words = [token.text for token in doc]
words

['Even',
 'under',
 'the',
 'Assyrian',
 'monarchs',
 'and',
 'especially',
 'under',
 'Sardanapalus',
 ',',
 'Babylon',
 'had',
 'been',
 'a',
 'scene',
 'of',
 'great',
 'intellectual',
 'activity',
 '.',
 ' ',
 '{',
 '111',
 '}',
 'Sardanapalus',
 ',',
 'though',
 'an',
 'Assyrian',
 ',',
 'had',
 'been',
 'quite',
 'Babylon',
 '-',
 'ized',
 '.',
 ' ',
 'He',
 'made',
 'a',
 'library',
 ',',
 'a',
 'library',
 'not',
 'of',
 'paper',
 'but',
 'of',
 'the',
 'clay',
 'tablets',
 'that',
 'were',
 'used',
 'for',
 'writing',
 'in',
 'Mesopotamia',
 'since',
 'early',
 'Sumerian',
 'days',
 '.',
 ' ',
 'His',
 'collection',
 'has',
 'been',
 'unearthed',
 'and',
 'is',
 'perhaps',
 'the',
 'most',
 'precious',
 'store',
 'of',
 'historical',
 'material',
 'in',
 'the',
 'world',
 '.',
 ' ',
 'The',
 'last',
 'of',
 'the',
 'Chaldean',
 'line',
 'of',
 'Babylonian',
 'monarchs',
 ',',
 'Nabonidus',
 ',',
 'had',
 'even',
 'keener',
 'literary',
 'tastes',
 '.',
 ' ',
 'He',
 'patronized',
 'antiquarian',
 'researches',
 ',',
 'and',
 'when',
 'a',
 'date',
 'was',
 'worked',
 'out',
 'by',
 'his',
 'investigators',
 'for',
 'the',
 'accession',
 'of',
 'Sargon',
 'I',
 'he',
 'commemorated',
 'the',
 'fact',
 'by',
 'inscriptions',
 '.',
 ' ',
 'But',
 'there',
 'were',
 'many',
 'signs',
 'of',
 'disunion',
 'in',
 'his',
 'empire',
 ',',
 'and',
 'he',
 'sought',
 'to',
 'centralize',
 'it',
 'by',
 'bringing',
 'a',
 'number',
 'of',
 'the',
 'various',
 'local',
 'gods',
 'to',
 'Babylon',
 'and',
 'setting',
 'up',
 'temples',
 'to',
 'them',
 'there',
 '.',
 ' ',
 'This',
 'device',
 'was',
 'to',
 'be',
 'practised',
 'quite',
 'successfully',
 'by',
 'the',
 'Romans',
 'in',
 'later',
 'times',
 ',',
 'but',
 'in',
 'Babylon',
 'it',
 'roused',
 'the',
 'jealousy',
 'of',
 'the',
 'powerful',
 'priesthood',
 'of',
 'Bel',
 'Marduk',
 ',',
 'the',
 'dominant',
 'god',
 'of',
 'the',
 'Babylonians',
 '.',
 ' ',
 'They',
 'cast',
 'about',
 'for',
 'a',
 'possible',
 'alternative',
 'to',
 'Nabonidus',
 'and',
 'found',
 'it',
 'in',
 'Cyrus',
 'the',
 'Persian',
 ',',
 'the',
 'ruler',
 'of',
 'the',
 'adjacent',
 'Median',
 'Empire',
 '.',
 ' ',
 'Cyrus',
 'had',
 'already',
 'distinguished',
 'himself',
 'by',
 'conquering',
 'Croesus',
 ',',
 'the',
 'rich',
 'king',
 'of',
 'Lydia',
 'in',
 'Eastern',
 'Asia',
 'Minor',
 '.',
 ' ',
 '{',
 '112',
 '}',
 'He',
 'came',
 'up',
 'against',
 'Babylon',
 ',',
 'there',
 'was',
 'a',
 'battle',
 'outside',
 'the',
 'walls',
 ',',
 'and',
 'the',
 'gates',
 'of',
 'the',
 'city',
 'were',
 'opened',
 'to',
 'him',
 '(',
 '538',
 'B.C.',
 ')',
 '.',
 ' ',
 'His',
 'soldiers',
 'entered',
 'the',
 'city',
 'without',
 'fighting',
 '.',
 ' ',
 'The',
 'crown',
 'prince',
 'Belshazzar',
 ',',
 'the',
 'son',
 'of',
 'Nabonidus',
 ',',
 'was',
 'feasting',
 ',',
 'the',
 'Bible',
 'relates',
 ',',
 'when',
 'a',
 'hand',
 'appeared',
 'and',
 'wrote',
 'in',
 'letters',
 'of',
 'fire',
 'upon',
 'the',
 'wall',
 'these',
 'mystical',
 'words',
 ':',
 '_',
 '"',
 'Mene',
 ',',
 'Mene',
 ',',
 'Tekel',
 ',',
 'Upharsin',
 ',',
 '"',
 '_',
 'which',
 'was',
 'interpreted',
 'by',
 'the',
 'prophet',
 'Daniel',
 ',',
 'whom',
 'he',
 'summoned',
 'to',
 'read',
 'the',
 'riddle',
 ',',
 'as',
 '"',
 'God',
 'has',
 'numbered',
 'thy',
 'kingdom',
 'and',
 'finished',
 'it',
 ';',
 'thou',
 'art',
 'weighed',
 'in',
 'the',
 'balance',
 'and',
 'found',
 'wanting',
 'and',
 'thy',
 'kingdom',
 'is',
 'given',
 'to',
 'the',
 'Medes',
 'and',
 'Persians',
 '.',
 '"',
 ' ',
 'Possibly',
 'the',
 'priests',
 'of',
 'Bel',
 'Marduk',
 'knew',
 'something',
 'about',
 'that',
 'writing',
 'on',
 'the',
 'wall',
 '.',
 ' ',
 'Belshazzar',
 'was',
 'killed',
 'that',
 'night',
 ',',
 'says',
 'the',
 'Bible',
 '.',
 'Nabonidus',
 'was',
 'taken',
 'prisoner',
 ',',
 'and',
 'the',
 'occupation',
 'of',
 'the',
 'city',
 'was',
 'so',
 'peaceful',
 'that',
 'the',
 'services',
 'of',
 'Bel',
 'Marduk',
 'continued',
 'without',
 'intermission',
 '.']

Filtering tokens#

After extracting the tokens, we can use some attributes and methods provided by spaCy, along with some vanilla Python methods, to filter the tokens to just the types we’re interested in analyzing.

# If we're only interested in analyzing word tokens, we can remove punctuation:
for token in doc[:20]:
    print(f'TOKEN: {token.text:15} IS_PUNCTUATION: {token.is_punct:}')
no_punct = [token for token in doc if token.is_punct == False]

no_punct[:20]

TOKEN: Even            IS_PUNCTUATION: False
TOKEN: under           IS_PUNCTUATION: False
TOKEN: the             IS_PUNCTUATION: False
TOKEN: Assyrian        IS_PUNCTUATION: False
TOKEN: monarchs        IS_PUNCTUATION: False
TOKEN: and             IS_PUNCTUATION: False
TOKEN: especially      IS_PUNCTUATION: False
TOKEN: under           IS_PUNCTUATION: False
TOKEN: Sardanapalus    IS_PUNCTUATION: False
TOKEN: ,               IS_PUNCTUATION: True
TOKEN: Babylon         IS_PUNCTUATION: False
TOKEN: had             IS_PUNCTUATION: False
TOKEN: been            IS_PUNCTUATION: False
TOKEN: a               IS_PUNCTUATION: False
TOKEN: scene           IS_PUNCTUATION: False
TOKEN: of              IS_PUNCTUATION: False
TOKEN: great           IS_PUNCTUATION: False
TOKEN: intellectual    IS_PUNCTUATION: False
TOKEN: activity        IS_PUNCTUATION: False
TOKEN: .               IS_PUNCTUATION: True

[Even,
 under,
 the,
 Assyrian,
 monarchs,
 and,
 especially,
 under,
 Sardanapalus,
 Babylon,
 had,
 been,
 a,
 scene,
 of,
 great,
 intellectual,
 activity,
  ,
 111]

# There are still some space tokens; here's how to remove spaces and newlines:
no_punct_or_space = [token for token in doc if token.is_punct == False and token.is_space == False]
for token in no_punct_or_space[:30]:
    print(token.text)

Even
under
the
Assyrian
monarchs
and
especially
under
Sardanapalus
Babylon
had
been
a
scene
of
great
intellectual
activity
111
Sardanapalus
though
an
Assyrian
had
been
quite
Babylon
ized
He
made

# Let's say we also want to remove numbers and lowercase everything that remains
lower_alpha = [token.lower_ for token in no_punct_or_space if token.is_alpha == True]
lower_alpha[:30]

['even',
 'under',
 'the',
 'assyrian',
 'monarchs',
 'and',
 'especially',
 'under',
 'sardanapalus',
 'babylon',
 'had',
 'been',
 'a',
 'scene',
 'of',
 'great',
 'intellectual',
 'activity',
 'sardanapalus',
 'though',
 'an',
 'assyrian',
 'had',
 'been',
 'quite',
 'babylon',
 'ized',
 'he',
 'made',
 'a']

One additional common filtering step is to remove stopwords. In theory, stopwords can be any words we’re not interested in analyzing, but in practice, they are often the most common words in a language that do not carry much semantic information (e.g., articles, conjunctions).

clean = [token.lower_ for token in no_punct_or_space if token.is_alpha == True and token.is_stop == False]
clean[:30]

['assyrian',
 'monarchs',
 'especially',
 'sardanapalus',
 'babylon',
 'scene',
 'great',
 'intellectual',
 'activity',
 'sardanapalus',
 'assyrian',
 'babylon',
 'ized',
 'library',
 'library',
 'paper',
 'clay',
 'tablets',
 'writing',
 'mesopotamia',
 'early',
 'sumerian',
 'days',
 'collection',
 'unearthed',
 'precious',
 'store',
 'historical',
 'material',
 'world']

We’ve used spaCy’s built-in stopword list; membership in this list determines the property is_stop for each token. It’s good practice to be wary of any built-in stopword list, however – there’s a good chance you will want to remove some words that aren’t on the list and to include some that are, especially if you’re working with specialized texts.

# We'll just pick a couple of words we know are in the example
custom_stopwords = ["assyrian", "babylon"]

custom_clean = [token for token in clean if token not in custom_stopwords]
custom_clean

['monarchs',
 'especially',
 'sardanapalus',
 'scene',
 'great',
 'intellectual',
 'activity',
 'sardanapalus',
 'ized',
 'library',
 'library',
 'paper',
 'clay',
 'tablets',
 'writing',
 'mesopotamia',
 'early',
 'sumerian',
 'days',
 'collection',
 'unearthed',
 'precious',
 'store',
 'historical',
 'material',
 'world',
 'chaldean',
 'line',
 'babylonian',
 'monarchs',
 'nabonidus',
 'keener',
 'literary',
 'tastes',
 'patronized',
 'antiquarian',
 'researches',
 'date',
 'worked',
 'investigators',
 'accession',
 'sargon',
 'commemorated',
 'fact',
 'inscriptions',
 'signs',
 'disunion',
 'empire',
 'sought',
 'centralize',
 'bringing',
 'number',
 'local',
 'gods',
 'setting',
 'temples',
 'device',
 'practised',
 'successfully',
 'romans',
 'later',
 'times',
 'roused',
 'jealousy',
 'powerful',
 'priesthood',
 'bel',
 'marduk',
 'dominant',
 'god',
 'babylonians',
 'cast',
 'possible',
 'alternative',
 'nabonidus',
 'found',
 'cyrus',
 'persian',
 'ruler',
 'adjacent',
 'median',
 'empire',
 'cyrus',
 'distinguished',
 'conquering',
 'croesus',
 'rich',
 'king',
 'lydia',
 'eastern',
 'asia',
 'minor',
 'came',
 'battle',
 'outside',
 'walls',
 'gates',
 'city',
 'opened',
 'soldiers',
 'entered',
 'city',
 'fighting',
 'crown',
 'prince',
 'belshazzar',
 'son',
 'nabonidus',
 'feasting',
 'bible',
 'relates',
 'hand',
 'appeared',
 'wrote',
 'letters',
 'fire',
 'wall',
 'mystical',
 'words',
 'mene',
 'mene',
 'tekel',
 'upharsin',
 'interpreted',
 'prophet',
 'daniel',
 'summoned',
 'read',
 'riddle',
 'god',
 'numbered',
 'thy',
 'kingdom',
 'finished',
 'thou',
 'art',
 'weighed',
 'balance',
 'found',
 'wanting',
 'thy',
 'kingdom',
 'given',
 'medes',
 'persians',
 'possibly',
 'priests',
 'bel',
 'marduk',
 'knew',
 'writing',
 'wall',
 'belshazzar',
 'killed',
 'night',
 'says',
 'bible',
 'nabonidus',
 'taken',
 'prisoner',
 'occupation',
 'city',
 'peaceful',
 'services',
 'bel',
 'marduk',
 'continued',
 'intermission']

At this point, we have a list of lower-cased tokens that doesn’t contain punctuation, white-space, numbers, or stopwords. Depending on your analytical goals, you may or may not want to do this much cleaning, but hopefully you have a greater appreciation for the kinds of cleaning that can be done with spaCy.

Counting tokens#

Now that we’ve used spaCy to tokenize and clean our text, we can begin one of the most fundamental text analysis tasks: counting words!

print("Number of tokens in document: ", len(doc))
print("Number of tokens in cleaned document: ", len(clean))
print("Number of unique tokens in cleaned document: ", len(set(clean)))

Number of tokens in document:  442
Number of tokens in cleaned document:  175
Number of unique tokens in cleaned document:  147

from collections import Counter
?Counter

from collections import Counter

full_counter = Counter([token.lower_ for token in doc])
full_counter.most_common(20)

[('the', 36),
 (',', 26),
 ('of', 20),
 ('.', 16),
 (' ', 14),
 ('and', 13),
 ('in', 9),
 ('a', 8),
 ('was', 8),
 ('to', 8),
 ('he', 6),
 ('by', 6),
 ('babylon', 5),
 ('had', 4),
 ('that', 4),
 ('his', 4),
 ('nabonidus', 4),
 ('it', 4),
 ('"', 4),
 ('been', 3)]

cleaned_counter = Counter(clean)
cleaned_counter.most_common(20)

[('babylon', 5),
 ('nabonidus', 4),
 ('bel', 3),
 ('marduk', 3),
 ('city', 3),
 ('assyrian', 2),
 ('monarchs', 2),
 ('sardanapalus', 2),
 ('library', 2),
 ('writing', 2),
 ('empire', 2),
 ('god', 2),
 ('found', 2),
 ('cyrus', 2),
 ('belshazzar', 2),
 ('bible', 2),
 ('wall', 2),
 ('mene', 2),
 ('thy', 2),
 ('kingdom', 2)]

Part-of-speech tagging#

Let’s consider some other aspects of the text that spaCy exposes for us. One of the most noteworthy features is part-of-speech tagging.

# spaCy provides two levels of POS tagging. Here's the more general level.
for token in doc[:30]:
    print(token.text, token.pos_)

Even ADV
under ADP
the DET
Assyrian ADJ
monarchs NOUN
and CCONJ
especially ADV
under ADP
Sardanapalus PROPN
, PUNCT
Babylon PROPN
had AUX
been AUX
a DET
scene NOUN
of ADP
great ADJ
intellectual ADJ
activity NOUN
. PUNCT
  SPACE
{ PUNCT
111 NUM
} PUNCT
Sardanapalus PROPN
, PUNCT
though SCONJ
an DET
Assyrian PROPN
, PUNCT

# spaCy also provides the more specific Penn Treenbank tags.
# https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
for token in doc[:30]:
    print(token.text, token.tag_)

Even RB
under IN
the DT
Assyrian JJ
monarchs NNS
and CC
especially RB
under IN
Sardanapalus NNP
, ,
Babylon NNP
had VBD
been VBN
a DT
scene NN
of IN
great JJ
intellectual JJ
activity NN
. .
  _SP
{ -LRB-
111 CD
} -RRB-
Sardanapalus NNP
, ,
though IN
an DT
Assyrian NNP
, ,

We can count the occurrences of each part of speech in the text, which may be useful for document classification (fiction may have different proportions of parts of speech relative to nonfiction, for example) or stylistic analysis (more on that later).

nouns = [token for token in doc if token.pos_ == "NOUN"]
verbs = [token for token in doc if token.pos_ == "VERB"]
proper_nouns = [token for token in doc if token.pos_ == "PROPN"]
adjectives = [token for token in doc if token.pos_ == "ADJ"]
adverbs = [token for token in doc if token.pos_ == "ADV"]

pos_counts = {
    "nouns": len(nouns),
    "verbs": len(verbs),
    "proper_nouns": len(proper_nouns),
    "adjectives": len(adjectives),
    "adverbs": len(adverbs) 
}

pos_counts

{'nouns': 66, 'verbs': 43, 'proper_nouns': 45, 'adjectives': 24, 'adverbs': 12}

spaCy performs morphosyntactic analysis of individual tokens, including lemmatizing inflected or conjugated forms to their base (dictionary) forms. Reducing words to their lemmatized forms can help to make a large corpus more manageable and is generally more effective than just stemming words (trimming the inflected/conjugated endings of words until just the base portion remains), but should only be done if the inflections are not relevant to your analysis.

for token in doc:
    if token.pos_ in ["NOUN", "VERB"] and token.orth_ != token.lemma_:
        print(f"{token.text:15} {token.lemma_}")

monarchs        monarch
ized            ize
made            make
tablets         tablet
used            use
writing         write
days            day
unearthed       unearth
monarchs        monarch
had             have
tastes          taste
patronized      patronize
researches      research
worked          work
investigators   investigator
commemorated    commemorate
inscriptions    inscription
were            be
signs           sign
sought          seek
bringing        bring
gods            god
setting         set
temples         temple
practised       practise
times           time
roused          rouse
found           find
distinguished   distinguish
conquering      conquer
came            come
was             be
walls           wall
gates           gate
opened          open
soldiers        soldier
entered         enter
fighting        fight
feasting        feast
relates         relate
appeared        appear
wrote           write
letters         letter
words           word
interpreted     interpret
summoned        summon
numbered        number
finished        finish
weighed         weigh
found           find
wanting         want
given           give
priests         priest
knew            know
killed          kill
says            say
taken           take
services        service
continued       continue

Parsing#

spaCy’s trained models also provide full dependency parsing, tagging word tokens with their syntactic relations to other tokens. This functionality drives spaCy’s built-in senticization as well.

We won’t spend much time exploring this feature, but it’s useful to see how it enables the extraction of multi-word “noun chunks” from the text. Note also that textacy (discussed below) has a built-in function to extract subject-verb-object triples from sentences.

for chunk in itertools.islice(doc.noun_chunks, 20):
    print(chunk.text)

the Assyrian monarchs
Sardanapalus
Babylon
a scene
great intellectual activity
 {111} Sardanapalus
an Assyrian
He
a library
a library
paper
the clay tablets
that
Mesopotamia
early Sumerian days
His collection
the most precious store
historical material
the world
 The last

Named-entity recognition#

spaCy’s models do a pretty good job of identifying and classifying named entities (people, places, organizations).

It is also fairly easy to customize and fine-tune these models by providing additional training data (e.g., texts with entities labeled according to the desired scheme), but that’s out of the scope of this workshop.

for ent in doc.ents:
    print(f'{ent.text:20} {ent.label_:15} {spacy.explain(ent.label_)}')

Assyrian             NORP            Nationalities or religious or political groups
Sardanapalus         WORK_OF_ART     Titles of books, songs, etc.
Babylon              GPE             Countries, cities, states
111                  CARDINAL        Numerals that do not fall under another type
Assyrian             NORP            Nationalities or religious or political groups
Babylon              ORG             Companies, agencies, institutions, etc.
Mesopotamia          LOC             Non-GPE locations, mountain ranges, bodies of water
early Sumerian days  DATE            Absolute or relative dates or periods
Chaldean             NORP            Nationalities or religious or political groups
Babylonian           NORP            Nationalities or religious or political groups
Nabonidus            ORG             Companies, agencies, institutions, etc.
Sargon               ORG             Companies, agencies, institutions, etc.
Romans               NORP            Nationalities or religious or political groups
Babylon              GPE             Countries, cities, states
Bel Marduk           PERSON          People, including fictional
Babylonians          NORP            Nationalities or religious or political groups
Nabonidus            ORG             Companies, agencies, institutions, etc.
Persian              NORP            Nationalities or religious or political groups
Croesus              PERSON          People, including fictional
Lydia                PERSON          People, including fictional
Eastern Asia Minor   LOC             Non-GPE locations, mountain ranges, bodies of water
112                  CARDINAL        Numerals that do not fall under another type
Babylon              ORG             Companies, agencies, institutions, etc.
538                  CARDINAL        Numerals that do not fall under another type
B.C.                 GPE             Countries, cities, states
Nabonidus            PERSON          People, including fictional
Bible                WORK_OF_ART     Titles of books, songs, etc.
Mene                 PERSON          People, including fictional
Tekel                ORG             Companies, agencies, institutions, etc.
Upharsin             PERSON          People, including fictional
Medes                NORP            Nationalities or religious or political groups
Persians             NORP            Nationalities or religious or political groups
Bel Marduk           PERSON          People, including fictional
that night           TIME            Times smaller than a day
Bible                WORK_OF_ART     Titles of books, songs, etc.
Nabonidus            PERSON          People, including fictional
Bel Marduk           PERSON          People, including fictional

What if we only care about geo-political entities or locations?

ent_filtered = [(ent.text, ent.label_) for ent in doc.ents if ent.label_ in ["GPE", "LOC"]]
ent_filtered

[('Babylon', 'GPE'),
 ('Mesopotamia', 'LOC'),
 ('Babylon', 'GPE'),
 ('Eastern Asia Minor', 'LOC'),
 ('B.C.', 'GPE')]

Visualizing Parses#

The built-in displaCy visualizer can render the results of the named-entity recognition, as well as the dependency parser.

?displacy.render

Object `displacy.render` not found.

from spacy import displacy
displacy.render(doc, style="dep", jupyter=True)

Corpus-level analysis with `textacy`#

Let’s shift to thinking about a whole corpus rather than a single document. We could analyze multiple documents with spaCy and then knit the results together with some extra Python. Instead, though, we’re going to take advantage of textacy, a library built on spaCy that adds corpus analysis features.

For reference, here’s the online documentation for textacy.

Generating corpora#

We’ll use some of the data that is included in textacy as our corpus. It is certainly possible to build your own corpus by importing data from files in plain text, XML, JSON, CSV or other formats, but working with one of textacy’s “pre-cooked” datasets simplifies things a bit.

import textacy.datasets

# We'll work with a dataset of ~8,400 ("almost all") U.S. Supreme Court
# decisions from November 1946 through June 2016
# https://github.com/bdewilde/textacy-data/releases/tag/supreme_court_py3_v1.0
data = textacy.datasets.SupremeCourt()

data.download()

The documentation indicates the metadata that is available with each text.

# help(textacy.datasets.supreme_court)

textacy is based on the concept of a corpus, whereas spaCy focuses on single documents. A textacy corpus is instantiated with a spaCy language model (we’re using the one from the first half of this workshop) that is used to apply its analytical pipeline to each text in the corpus, and also given a set of records consisting of texts with metadata (if metadata is available).

Let’s go ahead and define a set of records (texts with metadata) that we’ll then add to our corpus. To keep the processing time of the data set a bit more manageable, we’ll just look at a set of court decisions from a short span of time.

from IPython.display import display, HTML, clear_output
corpus = textacy.Corpus(nlp)

# There are 79 docs in this range -- they'll take a minute or two to process
recent_decisions = data.records(date_range=('2010-01-01', '2010-12-31'))

for i, record in enumerate(recent_decisions):
    clear_output(wait=True)
    display(HTML(f"<pre>{i+1:>2}/79: Adding {record[1]['case_name']}</pre>"))
    corpus.add_record(record)

# If the three lines above are taking too long to process all 79 docs,
# comment them out and uncomment the two lines below to download and import
# a preprocessed version of the corpus

#!wget https://github.com/sul-cidr/Workshops/raw/master/Text_Analysis_with_Python/data/scotus_2010.bin.gz
#corpus = textacy.Corpus.load(nlp, "scotus_2010.bin.gz")

15/79: Adding CURTIS DARNELL JOHNSON v. UNITED STATES

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[36], line 10
      8 clear_output(wait=True)
      9 display(HTML(f"<pre>{i+1:>2}/79: Adding {record[1]['case_name']}</pre>"))
---> 10 corpus.add_record(record)

File ~/opt/anaconda3/lib/python3.8/site-packages/textacy/corpus.py:287, in Corpus.add_record(self, record)
    279 def add_record(self, record: types.Record) -> None:
    280     """
    281     Add one record to the corpus, processing it into a :class:`spacy.tokens.Doc`
    282     using the :attr:`Corpus.spacy_lang` pipeline.
   (...)
    285         record
    286     """
--> 287     doc = self.spacy_lang(record[0])
    288     doc._.meta = record[1]
    289     self._add_valid_doc(doc)

File ~/opt/anaconda3/lib/python3.8/site-packages/spacy/language.py:1020, in Language.__call__(self, text, disable, component_cfg)
   1018     error_handler = proc.get_error_handler()
   1019 try:
-> 1020     doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
   1021 except KeyError as e:
   1022     # This typically happens if a component is not initialized
   1023     raise ValueError(Errors.E109.format(name=name)) from e

File ~/opt/anaconda3/lib/python3.8/site-packages/spacy/pipeline/trainable_pipe.pyx:52, in spacy.pipeline.trainable_pipe.TrainablePipe.__call__()

File ~/opt/anaconda3/lib/python3.8/site-packages/spacy/pipeline/tok2vec.py:125, in Tok2Vec.predict(self, docs)
    123     width = self.model.get_dim("nO")
    124     return [self.model.ops.alloc((0, width)) for doc in docs]
--> 125 tokvecs = self.model.predict(docs)
    126 batch_id = Tok2VecListener.get_batch_id(docs)
    127 for listener in self.listeners:

File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:315, in Model.predict(self, X)
    311 def predict(self, X: InT) -> OutT:
    312     """Call the model's `forward` function with `is_train=False`, and return
    313     only the output, instead of the `(output, callback)` tuple.
    314     """
--> 315     return self._func(self, X, is_train=False)[0]

File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/chain.py:54, in forward(model, X, is_train)
     52 callbacks = []
     53 for layer in model.layers:
---> 54     Y, inc_layer_grad = layer(X, is_train=is_train)
     55     callbacks.append(inc_layer_grad)
     56     X = Y

File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
    288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
    289     """Call the model's `forward` function, returning the output and a
    290     callback to compute the gradients via backpropagation."""
--> 291     return self._func(self, X, is_train=is_train)

File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/chain.py:54, in forward(model, X, is_train)
     52 callbacks = []
     53 for layer in model.layers:
---> 54     Y, inc_layer_grad = layer(X, is_train=is_train)
     55     callbacks.append(inc_layer_grad)
     56     X = Y

File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
    288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
    289     """Call the model's `forward` function, returning the output and a
    290     callback to compute the gradients via backpropagation."""
--> 291     return self._func(self, X, is_train=is_train)

File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/with_array.py:30, in forward(model, Xseq, is_train)
     28 def forward(model: Model[SeqT, SeqT], Xseq: SeqT, is_train: bool):
     29     if isinstance(Xseq, Ragged):
---> 30         return _ragged_forward(
     31             cast(Model[Ragged, Ragged], model), cast(Ragged, Xseq), is_train
     32         )
     33     elif isinstance(Xseq, Padded):
     34         return _padded_forward(
     35             cast(Model[Padded, Padded], model), cast(Padded, Xseq), is_train
     36         )

File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/with_array.py:90, in _ragged_forward(model, Xr, is_train)
     86 def _ragged_forward(
     87     model: Model[Ragged, Ragged], Xr: Ragged, is_train: bool
     88 ) -> Tuple[Ragged, Callable]:
     89     layer: Model[ArrayXd, ArrayXd] = model.layers[0]
---> 90     Y, get_dX = layer(Xr.dataXd, is_train)
     92     def backprop(dYr: Ragged) -> Ragged:
     93         return Ragged(get_dX(dYr.dataXd), dYr.lengths)

File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
    288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
    289     """Call the model's `forward` function, returning the output and a
    290     callback to compute the gradients via backpropagation."""
--> 291     return self._func(self, X, is_train=is_train)

File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/chain.py:54, in forward(model, X, is_train)
     52 callbacks = []
     53 for layer in model.layers:
---> 54     Y, inc_layer_grad = layer(X, is_train=is_train)
     55     callbacks.append(inc_layer_grad)
     56     X = Y

File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
    288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
    289     """Call the model's `forward` function, returning the output and a
    290     callback to compute the gradients via backpropagation."""
--> 291     return self._func(self, X, is_train=is_train)

File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/chain.py:54, in forward(model, X, is_train)
     52 callbacks = []
     53 for layer in model.layers:
---> 54     Y, inc_layer_grad = layer(X, is_train=is_train)
     55     callbacks.append(inc_layer_grad)
     56     X = Y

File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
    288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
    289     """Call the model's `forward` function, returning the output and a
    290     callback to compute the gradients via backpropagation."""
--> 291     return self._func(self, X, is_train=is_train)

File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/layernorm.py:25, in forward(model, X, is_train)
     24 def forward(model: Model[InT, InT], X: InT, is_train: bool) -> Tuple[InT, Callable]:
---> 25     N, mu, var = _get_moments(model.ops, X)
     26     Xhat = (X - mu) * var ** (-1.0 / 2.0)
     27     Y, backprop_rescale = _begin_update_scale_shift(model, Xhat)

File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/layernorm.py:76, in _get_moments(ops, X)
     74 def _get_moments(ops: Ops, X: Floats2d) -> Tuple[Floats2d, Floats2d, Floats2d]:
     75     # TODO: Do mean methods
---> 76     mu: Floats2d = X.mean(axis=1, keepdims=True)
     77     var: Floats2d = X.var(axis=1, keepdims=True) + 1e-08
     78     return cast(Floats2d, ops.asarray_f([X.shape[1]])), mu, var

File ~/opt/anaconda3/lib/python3.8/site-packages/numpy/core/_methods.py:180, in _mean(a, axis, dtype, out, keepdims, where)
    177         dtype = mu.dtype('f4')
    178         is_float16_result = True
--> 180 ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
    181 if isinstance(ret, mu.ndarray):
    182     ret = um.true_divide(
    183             ret, rcount, out=ret, casting='unsafe', subok=False)

KeyboardInterrupt: 

print(len(corpus))
[doc._.preview for doc in corpus[:5]]

['Doc(13007 tokens: "Respondent New York City taxes the possession o...")',
 'Doc(72325 tokens: "As amended by §203 of the Bipartisan Campaign R...")',
 'Doc(8333 tokens: "Under 28 U. S. C. §2254(d)(2), a federal court ...")',
 'Doc(9947 tokens: "The Illegal Immigration Reform and Immigrant Re...")',
 'Doc(5508 tokens: "Per Curiam.  From beginning to end, judicial pr...")']

We can see that the type of each item in the corpus is a Doc - this is a processed spaCy output document, with all of the extracted features. textacy provides some capacity to work with those features via its API, and also exposes new document-level features, such as ngrams and algorithms to determine a document’s readability level, among others.

We can filter this corpus based on metadata attributes.

corpus[9]._.meta

{'issue': '90520',
 'issue_area': 9,
 'n_min_votes': 0,
 'case_name': 'THE HERTZ CORPORATION v. MELINDA FRIEND et al.',
 'maj_opinion_author': 110,
 'decision_date': '2010-02-23',
 'decision_direction': 'liberal',
 'n_maj_votes': 9,
 'us_cite_id': '559 U.S. 77',
 'argument_date': '2009-11-10'}

# Here we'll find all the cases where the number of justices voting in the majority was greater than 6. 
supermajorities = [doc for doc in corpus.get(lambda doc: doc._.meta["n_maj_votes"] > 6)]
len(supermajorities)

supermajorities[0]._.preview

'Doc(8333 tokens: "Under 28 U. S. C. §2254(d)(2), a federal court ...")'

Finding important words in the corpus#

print("number of documents: ", corpus.n_docs)
print("number of sentences: ", corpus.n_sents)
print("number of tokens: ", corpus.n_tokens)

number of documents:  79
number of sentences:  42470
number of tokens:  1189205

?corpus.word_counts

corpus.word_counts(by="orth_", filter_stops=False, filter_punct=False, filter_nums=False)

{'Respondent': 250,
 'New': 413,
 'York': 284,
 'City': 340,
 'taxes': 56,
 'the': 56729,
 'possession': 216,
 'of': 30659,
 'cigarettes': 23,
 '.': 44541,
 'Petitioner': 246,
 'Hemi': 82,
 'Group': 37,
 ',': 72059,
 'based': 526,
 'in': 14744,
 'Mexico': 16,
 'sells': 6,
 'online': 49,
 'to': 25427,
 'residents': 25,
 'Neither': 74,
 'state': 1920,
 'nor': 242,
 'city': 48,
 'law': 1899,
 'requires': 352,
 'out': 548,
 '-': 10949,
 'sellers': 13,
 'such': 1571,
 'as': 5683,
 'charge': 102,
 'collect': 35,
 'or': 5809,
 'remit': 7,
 "'s": 9728,
 'tax': 175,
 ';': 4864,
 'instead': 175,
 'must': 1197,
 'recover': 50,
 'its': 2590,
 'on': 5214,
 'sales': 48,
 'directly': 134,
 'from': 2842,
 'purchasers': 16,
 'But': 1066,
 'Jenkins': 51,
 'Act': 1278,
 '15': 449,
 'U.': 5900,
 'S.': 6072,
 'C.': 1321,
 '§': 6375,
 '375': 37,
 '378': 36,
 'submit': 57,
 'customer': 9,
 'information': 310,
 'States': 2256,
 'into': 572,
 'which': 2646,
 'they': 1302,
 'ship': 22,
 'and': 15113,
 'State': 1262,
 'has': 2191,
 'agreed': 176,
 'forward': 51,
 'that': 18373,
 'That': 507,
 'helps': 26,
 'track': 19,
 'down': 115,
 'cigarette': 14,
 'who': 1201,
 'do': 1015,
 'not': 8499,
 'pay': 124,
 'their': 1507,
 'Against': 27,
 'backdrop': 10,
 'filed': 402,
 'this': 3792,
 'lawsuit': 51,
 'under': 2030,
 'Racketeer': 6,
 'Influenced': 6,
 'Corrupt': 6,
 'Organizations': 11,
 '(': 14766,
 'RICO': 99,
 ')': 18634,
 'alleging': 33,
 'failure': 216,
 'file': 195,
 'reports': 84,
 'with': 3669,
 'constituted': 41,
 'mail': 76,
 'wire': 37,
 'fraud': 328,
 'are': 2675,
 'defined': 144,
 '"': 30807,
 'racketeering': 5,
 'activit[ies': 1,
 ']': 3140,
 '18': 499,
 '1961(1': 4,
 'subject': 502,
 'enforcement': 148,
 'civil': 198,
 '1964(c': 12,
 'The': 5952,
 'District': 1052,
 'Court': 5729,
 'dismissed': 61,
 'claims': 611,
 'but': 1669,
 'Second': 481,
 'Circuit': 695,
 'vacated': 47,
 'judgment': 1061,
 'remanded': 129,
 'Among': 37,
 'other': 1491,
 'things': 81,
 'Appeals': 794,
 'held': 625,
 'asserted': 113,
 'injury': 174,
 '—': 1921,
 'lost': 60,
 'revenue': 51,
 'came': 32,
 'about': 767,
 'by': 4551,
 'reason': 416,
 'predicate': 75,
 'frauds': 23,
 'It': 1106,
 'accordingly': 30,
 'determined': 150,
 'had': 1656,
 'stated': 232,
 'a': 18294,
 'valid': 136,
 'claim': 938,
 'Held': 68,
 ':': 1401,
 'is': 9354,
 'reversed': 181,
 'case': 1985,
 '541': 78,
 'F.': 1407,
 '3d': 1008,
 '425': 32,
 'Chief': 146,
 'Justice': 872,
 'Roberts': 124,
 'delivered': 157,
 'opinion': 1212,
 'part': 695,
 'concluding': 116,
 'because': 1303,
 'can': 1523,
 'show': 217,
 'it': 4838,
 'alleged': 202,
 'violation': 393,
 'Pp': 393,
 '5': 623,
 'To': 306,
 'establish': 166,
 'an': 4200,
 'plaintiff': 258,
 'offense': 354,
 'only': 1770,
 'was': 3521,
 "'": 4309,
 'for': 7978,
 'cause': 375,
 'his': 1845,
 'proximate': 26,
 'well': 492,
 'Holmes': 34,
 'v.': 4345,
 'Securities': 57,
 'Investor': 6,
 'Protection': 60,
 'Corporation': 43,
 '503': 63,
 '258': 22,
 '268': 37,
 'Proximate': 2,
 'purposes': 294,
 'should': 1012,
 'be': 4659,
 'evaluated': 12,
 'light': 286,
 'common': 264,
 'foundations': 7,
 'thus': 437,
 'some': 736,
 'direct': 161,
 'relation': 65,
 'between': 624,
 'injurious': 5,
 'conduct': 686,
 'Ibid': 461,
 'A': 857,
 'link': 22,
 'too': 179,
 'remote': 23,
 'purely': 34,
 'contingent': 12,
 'indirec[t': 2,
 'insufficient': 68,
 'Id.': 1254,
 'at': 9800,
 '271': 14,
 '274': 39,
 'causal': 12,
 'theory': 205,
 'satisfy': 106,
 'relationship': 136,
 'requirement': 376,
 'Indeed': 196,
 'here': 652,
 'far': 204,
 'more': 1214,
 'attenuated': 12,
 'than': 1270,
 'one': 1262,
 'rejected': 248,
 'According': 80,
 'committed': 136,
 'selling': 17,
 'failing': 51,
 'required': 397,
 'Without': 38,
 'could': 1033,
 'pass': 41,
 'even': 910,
 'if': 1636,
 'been': 1138,
 'so': 1158,
 'inclined': 8,
 'Some': 67,
 'customers': 53,
 'legally': 61,
 'obligated': 10,
 'failed': 197,
 'Because': 307,
 'did': 1303,
 'receive': 130,
 'determine': 326,
 'pursue': 78,
 'those': 1158,
 'payment': 52,
 'thereby': 85,
 'injured': 30,
 'amount': 258,
 'portion': 78,
 'back': 130,
 'were': 1191,
 'never': 247,
 'collected': 7,
 'As': 716,
 'reiterated': 20,
 '[': 3078,
 't]he': 180,
 'general': 417,
 'tendency': 12,
 'regard': 62,
 'damages': 174,
 'least': 289,
 'go': 94,
 'beyond': 230,
 'first': 718,
 'step': 94,
 'i': 1008,
 'd.': 975,
 '272': 29,
 'applies': 365,
 'full': 243,
 'force': 233,
 'inquiries': 17,
 'e.g.': 879,
 'ibid': 207,
 'causation': 23,
 'move': 26,
 'suffers': 15,
 'same': 765,
 'defect': 17,
 'Anza': 20,
 'Ideal': 19,
 'Steel': 25,
 'Supply': 10,
 'Corp.': 207,
 '547': 76,
 '451': 41,
 '458': 59,
 '461': 50,
 'where': 588,
 'causing': 13,
 'harm': 146,
 'distinct': 73,
 'giving': 86,
 'rise': 60,
 'see': 1707,
 'disconnect': 2,
 'sharper': 2,
 'In': 1893,
 'party': 520,
 'both': 465,
 'engaged': 65,
 'harmful': 22,
 'fraudulent': 38,
 'act': 231,
 'Here': 107,
 'liability': 247,
 'rests': 65,
 'just': 333,
 'separate': 209,
 'actions': 258,
 'carried': 38,
 'parties': 678,
 'extend': 71,
 'situations': 32,
 'defendant': 613,
 'third': 159,
 'made': 653,
 'easier': 33,
 'fourth': 41,
 'taxpayer': 11,
 'taxpayers': 13,
 'caused': 112,
 'place': 389,
 'decided': 169,
 'Put': 8,
 'simply': 305,
 'obligation': 81,
 'before': 826,
 'stretched': 4,
 'chain': 26,
 'declines': 15,
 'today': 275,
 'See': 3037,
 '460': 41,
 '9': 509,
 'b': 120,
 'attempts': 59,
 'avoid': 218,
 'conclusion': 453,
 'characterizing': 9,
 'merely': 201,
 'systematic': 37,
 'scheme': 133,
 'defraud': 39,
 'escape': 26,
 'embraced': 14,
 'all': 1495,
 'indirectly': 20,
 'harmed': 15,
 'precedent': 165,
 'would': 2822,
 'become': 106,
 'mere': 81,
 'pleading': 45,
 'rule': 715,
 'makes': 276,
 'clear': 455,
 'compensable': 5,
 'flowing': 5,
 '...': 1600,
 'necessarily': 133,
 'acts': 187,
 'supra': 994,
 '457': 31,
 'led': 63,
 'injuries': 28,
 'also': 1734,
 'errs': 17,
 'relying': 43,
 'Bridge': 14,
 'Phoenix': 3,
 'Bond': 5,
 '&': 499,
 'Indemnity': 8,
 'Co.': 462,
 '553': 61,
 '_': 2073,
 'There': 268,
 'plaintiffs': 336,
 'straightforward': 45,
 'involved': 144,
 'easily': 77,
 'identifiable': 4,
 'connection': 70,
 'issue': 788,
 'there': 795,
 'petitioners': 370,
 'misrepresentations': 16,
 'no': 1984,
 'independent': 312,
 'factors': 215,
 'account[ed': 3,
 'anything': 112,
 'Multiple': 4,
 'steps': 56,
 'And': 721,
 'contrast': 109,
 'certainly': 79,
 '10': 567,
 '14': 413,
 'J.': 1306,
 'Scalia': 334,
 'Thomas': 223,
 'Alito': 154,
 'JJ': 104,
 'joined': 184,
 'Ginsburg': 118,
 'concurring': 560,
 'Breyer': 176,
 'dissenting': 483,
 'Stevens': 381,
 'Kennedy': 228,
 'Sotomayor': 120,
 'took': 146,
 'consideration': 157,
 'decision': 1001,
 'HEMI': 3,
 'GROUP': 3,
 'LLC': 25,
 'KAI': 3,
 'GACHUPIN': 3,
 'PETITIONERS': 68,
 'CITY': 12,
 'OF': 57,
 'NEW': 8,
 'YORK': 6,
 'writ': 286,
 'certiorari': 374,
 'united': 168,
 'states': 313,
 'court': 2667,
 'appeals': 221,
 'second': 420,
 'circuit': 174,
 'January': 58,
 '25': 197,
 '2010': 352,
 'seldom': 5,
 'own': 406,
 'Federal': 681,
 'however': 606,
 'vendors': 6,
 'argues': 158,
 'constitutes': 72,
 'lose': 25,
 'tens': 7,
 'millions': 15,
 'dollars': 18,
 'unrecovered': 1,
 'we': 1960,
 'hold': 205,
 'We': 899,
 'therefore': 414,
 'reverse': 63,
 'contrary': 247,
 'I': 1534,
 'This': 818,
 'arises': 26,
 'motion': 320,
 'dismiss': 90,
 'accept': 144,
 'true': 206,
 'factual': 154,
 'allegations': 57,
 'amended': 139,
 'complaint': 177,
 'Leatherman': 1,
 'Tarrant': 1,
 'County': 123,
 'Narcotics': 5,
 'Intelligence': 4,
 'Coordination': 3,
 'Unit': 2,
 '507': 43,
 '163': 29,
 '164': 30,
 '1993': 96,
 'authorizes': 79,
 'impose': 182,
 'N.': 260,
 'Y.': 84,
 'Unconsol': 2,
 'Law': 303,
 'Ann': 196,
 '9436(1': 2,
 'West': 131,
 'Supp': 333,
 '2009': 563,
 'Under': 214,
 'authority': 532,
 'levied': 3,
 '$': 227,
 '1.50': 1,
 'per': 215,
 'pack': 3,
 'each': 307,
 'standard': 410,
 'possessed': 24,
 'within': 601,
 'sale': 67,
 'use': 505,
 'Admin': 11,
 'Code': 355,
 '11': 518,
 '1302(a': 1,
 '2008': 437,
 'Record': 81,
 'A1016': 1,
 'When': 243,
 'buy': 10,
 'seller': 12,
 'responsible': 68,
 'charging': 24,
 'collecting': 47,
 'remitting': 1,
 'Tax': 17,
 '471(2': 1,
 'Out': 1,
 'Smokes-Spirits.com': 3,
 'Inc.': 546,
 '432': 26,
 '433': 20,
 'CA2': 124,
 'Instead': 129,
 'recovering': 2,
 'sold': 31,
 'outside': 167,
 'difficult': 148,
 'often': 211,
 'reluctant': 14,
 'tough': 2,
 'One': 98,
 'way': 281,
 'gather': 8,
 'assist': 16,
 'through': 426,
 '63': 65,
 'Stat': 412,
 '884': 6,
 '69': 34,
 '627': 10,
 'register': 74,
 'report': 114,
 'tobacco': 12,
 'administrators': 15,
 'listing': 41,
 'name': 80,
 'address': 246,
 'quantity': 12,
 'purchased': 18,
 'have': 3351,
 'executed': 25,
 'agreement': 420,
 'undertake': 32,
 'cooperate': 12,
 'fully': 87,
 'keep': 250,
 'promptly': 19,
 'informed': 95,
 'reference': 119,
 'any': 2359,
 'person': 421,
 'transaction': 36,
 'including': 397,
 'i]nformation': 1,
 'obtained': 83,
 'may': 1890,
 'result': 353,
 'additional': 214,
 'provided': 261,
 'disclosure': 280,
 'permissible': 48,
 'existing': 105,
 'laws': 397,
 'agreements': 70,
 'A1003': 1,
 'asserts': 84,
 'forwards': 1,
 'A998': 1,
 'Amended': 8,
 'Compl': 5,
 '¶54': 1,
 '¶¶58': 2,
 '59': 49,
 'company': 103,
 'does': 1700,
 'alleges': 24,
 'cost': 43,
 'hundreds': 20,
 'year': 356,
 'excise': 5,
 'A996': 3,
 'Based': 35,
 'federal': 1542,
 'B': 206,
 'provides': 363,
 'private': 339,
 'action': 606,
 'a]ny': 17,
 'business': 445,
 'property': 652,
 'section': 200,
 '1962': 18,
 'chapter': 48,
 'Section': 267,
 'turn': 125,
 'contains': 75,
 'criminal': 495,
 'provisions': 483,
 'Specifically': 22,
 '1962(c': 2,
 'invokes': 22,
 'unlawful': 90,
 'employed': 52,
 'associated': 35,
 'enterprise': 19,
 'activities': 218,
 'affect': 77,
 'interstate': 88,
 'commerce': 83,
 'participate': 35,
 'affairs': 31,
 'pattern': 22,
 'activity': 154,
 'R]acketeering': 1,
 'include': 214,
 'number': 220,
 'called': 103,
 'two': 757,
 'identifying': 41,
 'constitute': 120,
 'offenses': 123,
 'A980': 4,
 'Invoking': 5,
 'suffered': 53,
 'form': 197,
 'terms--"by': 1,
 'contest': 21,
 'characterization': 26,
 'violations': 109,
 'actionable': 7,
 'assume': 108,
 'without': 736,
 'deciding': 106,
 'material': 172,
 'serve': 111,
 'determining': 172,
 'owner': 40,
 'officer': 105,
 'Kai': 1,
 'Gachupin': 3,
 'individual': 316,
 'duty': 194,
 'Nexicon': 1,
 'No': 393,
 '03': 5,
 'CV': 1,
 '383': 38,
 'DAB': 1,
 '2006': 287,
 'WL': 20,
 '647716': 1,
 '*': 387,
 '7-*8': 1,
 'SDNY': 24,
 'Mar.': 48,
 'formed': 60,
 '7-*10': 1,
 'ground': 196,
 'whether': 1708,
 'loss': 58,
 '1964': 45,
 'further': 411,
 'proceedings': 325,
 'established': 290,
 'operated': 29,
 '447': 50,
 '448': 15,
 '444': 47,
 '445': 42,
 'concluded': 310,
 '440': 25,
 'viable': 5,
 'Judge': 102,
 'Winter': 15,
 'dissented': 11,
 'petition': 473,
 'asking': 53,
 'Pet': 233,
 'Cert': 205,
 'i.': 13,
 'granted': 316,
 '556': 72,
 'II': 307,
 'Though': 26,
 'framed': 16,
 'single': 202,
 'question': 966,
 'raises': 43,
 'issues': 125,
 'First': 800,
 'allegedly': 49,
 'decide': 301,
 '1992': 105,
 'set': 342,
 'forth': 168,
 'addressed': 110,
 'brought': 178,
 'SIPC': 8,
 'against': 799,
 'defendants': 177,
 'whom': 156,
 'manipulated': 4,
 'stock': 38,
 'prices': 33,
 '262': 20,
 '263': 42,
 'reimburse': 2,
 'certain': 445,
 'registered': 87,
 'broker': 8,
 'dealers': 47,
 'event': 105,
 'unable': 44,
 'meet': 64,
 'financial': 80,
 'obligations': 52,
 '261': 19,
 'conspiracy': 59,
 'manipulators': 1,
 'detected': 2,
 'collapsed': 5,
 'insurer': 1,
 'ultimately': 76,
 'hook': 2,
 'nearly': 73,
 '13': 452,
 'million': 63,
 'cover': 87,
 'conspirators': 6,
 'phrase': 187,
 'used': 324,
 'explained': 261,
 'Applying': 43,
 'hand': 92,
 'quoting': 424,
 'Associated': 7,
 'Gen.': 42,
 'Contractors': 4,
 'Cal': 71,
 'Carpenters': 1,
 '459': 44,
 '519': 42,
 '534': 55,
 '1983': 137,
 'Southern': 38,
 'Pacific': 90,
 'Darnell': 1,
 'Taenzer': 1,
 'Lumber': 2,
 '245': 15,
 '531': 44,
 '533': 70,
 '1918': 5,
 'internal': 412,
 'quotation': 395,
 'marks': 398,
 'omitted': 484,
 'Our': 158,
 'cases': 934,
 'confirm': 38,
 'slip': 349,
 'op': 346,
 '19': 240,
 'us': 462,
 'confirms': 28,
 'indirect': 9,
 'considered': 217,
 'competitor': 10,
 'National': 215,
 'defrauded': 9,
 'able': 116,
 'undercut': 11,
 'lower': 128,
 'offered': 58,
 'contended': 20,
 'allowed': 105,
 'attract': 11,
 'expense': 17,
 'Finding': 24,
 'victim': 66,
 'being': 186,
 'recognized': 245,
 'harms': 35,
 'when': 1284,
 'applicable': 184,
 'offering': 10,
 'entirely': 116,
 'defrauding': 3,
 'constituting': 33,
 'Thus': 258,
 'viewed': 48,
 'point': 360,
 'important': 232,
 'nevertheless': 53,
 'found': 348,
 'distinction': 113,
 'relevant': 386,
 'sufficient': 207,
 'defeat': 28,
 'decline': 53,
 'cf': 120,
 'n.': 894,
 '46': 97,
 'finding': 240,
 'antitrust': 30,
 'context': 325,
 'stems': 7,
 'most': 431,
 'persons': 247,
 'victims': 59,
 'highlighted': 9,
 'better': 85,
 'situated': 14,
 'incentive': 33,
 'sue': 49,
 '269': 16,
 '270': 21,
 'seek': 211,
 'recovery': 43,
 'imposes': 87,
 '2.75': 1,
 'double': 34,
 'what': 783,
 'charges': 80,
 '471(1': 1,
 'opine': 2,
 'bring': 71,
 'Suffice': 3,
 'say': 221,
 'concrete': 29,
 'incentives': 18,
 'try': 31,
 'accuses': 2,
 'Anzas': 1,
 'substantial': 213,
 'money': 134,
 'If': 429,
 'expected': 39,
 'appropriate': 227,
 'remedies': 79,
 'dissent': 470,
 'foreseeability': 6,
 'rather': 317,
 'existence': 97,
 'sufficiently': 94,
 'find': 233,
 'satisfied': 77,
 'foreseeable': 23,
 'consequence': 79,
 'intended': 231,
 'indeed': 87,
 'desired': 13,
 'falls': 55,
 'risks': 36,
 'Congress': 1679,
 'sought': 242,
 'prevent': 189,
 'Post': 202,
 '6': 594,
 'line': 163,
 'reasoning': 102,
 'sounds': 10,
 'familiar': 33,
 'precisely': 50,
 'argument': 541,
 'lodged': 7,
 'majority': 415,
 'criticized': 16,
 'view': 556,
 'permit[ting': 1,
 'evade': 9,
 'consequences': 158,
 'behavior': 76,
 '470': 56,
 'carry': 60,
 'day': 202,
 'asked': 146,
 'revisit': 12,
 'concepts': 6,
 'course': 264,
 'many': 355,
 'shapes': 5,
 'precedents': 218,
 'make': 467,
 'focus': 62,
 'directness': 4,
 'mention': 42,
 'concept': 62,
 'offers': 59,
 'responses': 32,
 'challenges': 140,
 'our': 1019,
 'Brief': 919,
 '42': 177,
 'Having': 25,
 'broadly': 48,
 'contends': 153,
 'Otherwise': 13,
 'example': 331,
 'give': 215,
 'competitive': 22,
 'advantage': 34,
 'over': 456,
 '454': 52,
 '455': 49,
 'allegation': 16,
 'circumvent': 10,
 'claiming': 39,
 'aim': 42,
 'increase': 54,
 'market': 167,
 'share': 54,
 '460.1': 1,
 'moreover': 80,
 'Sedima': 2,
 'P.': 107,
 'R.': 263,
 'L.': 350,
 'Imrex': 2,
 '473': 60,
 '479': 68,
 '497': 28,
 '1985': 74,
 'statement': 264,
 'went': 39,
 'allege': 24,
 'assertion': 65,
 'legal': 414,
 'very': 226,
 'relies': 92,
 'reaffirmed': 24,
 'wrongful': 42,
 'competing': 32,
 'bidders': 7,
 'county': 48,
 'lien': 4,
 'auction': 18,
 'liens': 5,
 'profitable': 3,
 'lowest': 8,
 'possible': 180,
 'bid': 7,
 'multiple': 48,
 'low': 22,
 'bidding': 4,
 'percentage': 28,
 'penalty': 200,
 'bidder': 6,
 'require': 330,
 '0': 4,
 '%': 135,
 'awarded': 56,
 'devised': 11,
 'plan': 163,
 'allocate': 3,
 'rotational': 2,
 'basis': 381,
 '3': 747,
 'noted': 256,
 'created': 157,
 'perverse': 9,
 'Bidders': 1,
 'addition': 121,
 'themselves': 141,
 'sen[t': 1,
 'agents': 19,
 'behalf': 110,
 'obtain': 106,
 'disproportionate': 29,
 'prohibited': 108,
 ...}

def show_doc_counts(input_corpus, weighting, limit=20):
    doc_counts = input_corpus.word_doc_counts(weighting=weighting, filter_stops=True, by="orth_")
    print("\n".join(f"{a:15} {b}" for a, b in sorted(doc_counts.items(), key=lambda x:x[1], reverse=True)[:limit]))

word_doc_counts provides a few ways of quantifying the prevalence of individual words across the corpus: whether a word appears many times in most documents, just a few times in a few documents, many times in a few documents, or just a few times in most documents.

print("# DOCS APPEARING IN / TOTAL # DOCS", "\n", "-----------", sep="")
show_doc_counts(corpus, "freq")
print("\n", "LOG(TOTAL # DOCS / # DOCS APPEARING IN)", "\n", "-----------", sep="")
show_doc_counts(corpus, "idf")

# DOCS APPEARING IN / TOTAL # DOCS
-----------
Court           0.9873417721518988
case            0.9873417721518988
certiorari      0.9873417721518988
granted         0.9873417721518988
U.              0.9746835443037974
S.              0.9746835443037974
judgment        0.9746835443037974
v.              0.9746835443037974
decision        0.9746835443037974
court           0.9746835443037974
C.              0.9620253164556962
held            0.9620253164556962
opinion         0.9620253164556962
2009            0.9620253164556962
1               0.9620253164556962
F.              0.9493670886075949
Justice         0.9493670886075949
Id.             0.9493670886075949
e.g.            0.9493670886075949
issue           0.9493670886075949

LOG(TOTAL # DOCS / # DOCS APPEARING IN)
-----------
cigarettes      4.382026634673881
Hemi            4.382026634673881
cigarette       4.382026634673881
RICO            4.382026634673881
racketeering    4.382026634673881
activit[ies     4.382026634673881
1964(c          4.382026634673881
Proximate       4.382026634673881
indirec[t       4.382026634673881
Anza            4.382026634673881
Ideal           4.382026634673881
disconnect      4.382026634673881
sharper         4.382026634673881
Phoenix         4.382026634673881
account[ed      4.382026634673881
HEMI            4.382026634673881
GROUP           4.382026634673881
KAI             4.382026634673881
GACHUPIN        4.382026634673881
YORK            4.382026634673881

textacy provides implementations of algorithms for identifying words and phrases that are representative of a document (aka keyterm extraction).

from textacy.extract import keyterms as ke

# corpus[0].text

# Run the Yake algorithim (Campos et al., 2018) on a given document
key_terms_yake = ke.yake(corpus[0])
key_terms_yake

[('New York City', 0.002288298045327596),
 ('New York State', 0.0060401030525529436),
 ('U. S. C.', 0.0075325125188752005),
 ('Jenkins Act', 0.012460549374297763),
 ('York City customer', 0.021988206972901634),
 ('RICO', 0.027329026591127712),
 ('Hemi Group', 0.03384924285412936),
 ('York City cigarette', 0.03513697406836725),
 ('Jenkins Act information', 0.03892293549952245),
 ('RICO claim', 0.042756731344701926)]

Keyword in context#

Sometimes researchers find it helpful just to see a particular keyword in context.

for doc in corpus[:5]:
    print("\n", doc._.meta.get('case_name'), "\n", "-" * len(doc._.meta.get('case_name')), "\n")
    for match in textacy.extract.kwic.keyword_in_context(doc.text, "judgment"):
        print(" ".join(match).replace("\n", " "))

 HEMI GROUP, LLC AND KAI GACHUPIN v. CITY OF NEW YORK, NEW YORK 
 -------------------------------------------------------------- 

ed the claims, but the Second Circuit vacated the  judgment  and remanded. Among other things, the Court of Ap
the City had stated a valid RICO claim. Held: The  judgment  is reversed, and the case is remanded. 541 F. 3d 
 opinion concurring in part and concurring in the  judgment . Breyer, J., filed a dissenting opinion, in which
  The Second Circuit vacated the District Court's  judgment  and remanded for further proceedings. The Court o
 it. The City, therefore, has no RICO claim.  The  judgment  of the Court of Appeals for the Second Circuit is
insburg, concurring in part and concurring in the  judgment .  As the Court points out, this is a case "about 
he above-stated view, and I concur in the Court's  judgment . HEMI GROUP, LLC and KAI GACHUPIN, PETITIONERS v.

 CITIZENS UNITED v. FEDERAL ELECTION COMMISSION 
 ---------------------------------------------- 

ppellee Federal Election Commission (FEC) summary  judgment . Held:  1. Because the question whether §441b app
(Scalia, J., concurring in part and concurring in  judgment ). We agree with that conclusion and hold that sta
t later convened to hear the cause. The resulting  judgment  gives rise to this appeal.  Citizens United has a
m), and then granted the FEC's motion for summary  judgment , App. 261a-262a. See id., at 261a ("Based on the 
or opinion, we find that the [FEC] is entitled to  judgment  as a matter of law. See Citizen[s] United v. FEC,
onnell, supra, at 339 (Kennedy, J., concurring in  judgment  in part and dissenting in part). The Snowe-Jeffor
rt's later opinion, which granted the FEC summary  judgment , was "[b]ased on the reasoning of [its] prior opi
62 (Scalia, J., concurring in part, concurring in  judgment  in part, and dissenting in part); id., at 273-275
part, concurring in result in part, concurring in  judgment  in part, and dissenting in part); id., at 322-338
pore over each word of a text to see if, in their  judgment , it accords with the 11-factor test they have pro
er, 502 U. S., at 124 (Kennedy, J., concurring in  judgment ), the quoted language from WRTL provides a suffic
toral opportunities means making and implementing  judgment s about which strengths should be permitted to con
issenting); id., at 773 (White, J., concurring in  judgment ). With the advent of the Internet and the decline
S. 334, 360-361 (1995) (Thomas, J., concurring in  judgment ). Yet television networks and major newspapers ow
t 341-343; id., at 367 (Thomas, J., concurring in  judgment ). At the founding, speech was open, comprehensive
endent expenditures; if they surrender their best  judgment ; and if they put expediency before principle, the
ation's course; still others simply might suspend  judgment  on these points but decide to think more about is
nell, supra, at 341 (opinion of Kennedy, J.). The  judgment  of the District Court is reversed with respect to
ctions on corporate independent expenditures. The  judgment  is affirmed with respect to BCRA's disclaimer and
Thomas, JJ., concurring in part and concurring in  judgment ); McConnell, 540 U. S., at 247, 264, 286 (opinion
86 (Thomas, J., concurring in part, concurring in  judgment  in part, and dissenting in part).  These readings
3) (Scalia, J., concurring in part, concurring in  judgment  in part, and dissenting in part) (quoting C. Cook
 U. S. 334, 360 (1995) (Thomas, J., concurring in  judgment ); see also McConnell, 540 U. S., at 252-253 (opin
 an affirmative answer to that question is, in my  judgment , profoundly misguided. Even more misguided is the
 Comm. (NRWC), and have accepted the "legislative  judgment  that the special characteristics of the corporate
 of §203. App. 23a-24a. In its motion for summary  judgment , however, Citizens United expressly abandoned its
Roberts, J., concurring in part and concurring in  judgment ).  Consider just three of the narrower grounds of
s of longstanding practice and Congress' reasoned  judgment  that certain regulations which leave "untouched f
precedents "represent respect for the legislative  judgment  that the special characteristics of the corporate
oach taken by the majority cannot be right, in my  judgment . It disregards our constitutional history and the
erting an " 'undue influence on an officeholder's  judgment ' " and from creating " 'the appearance of such in
orations).  When the McConnell Court affirmed the  judgment  of the District Court regarding §203, we did not 
ll, 540 U. S., at 306 (Kennedy, J., concurring in  judgment  in part and dissenting in part); see also id., at
63 (Scalia, J., concurring in part, concurring in  judgment  in part, and dissenting in part), a disreputable 
nted where, as here, we deal with a congressional  judgment  that has remained essentially unchanged throughou
tisfy heightened judicial scrutiny of legislative  judgment s will vary up or down with the novelty and plausi
years of bipartisan deliberation and its reasoned  judgment  on this basis, without first confirming that the 
 J., dissenting). "In the meantime, a legislative  judgment  that 'enough is enough' should command the greate
Congress' factual findings and its constitutional  judgment : It acknowledges the validity of the interest in 
O, 335 U. S., at 144 (Rutledge, J., concurring in  judgment )), and this, in turn, "interferes with the 'open 
he expansive protections afforded by the business  judgment  rule. Blair & Stout 320; see also id., at 298-315
relevance of established facts and the considered  judgment s of state and federal legislatures over many deca
 corporate money in politics.  I would affirm the  judgment  of the District Court. CITIZENS UNITED, APPELLANT
3) (Thomas, J., concurring in part, concurring in  judgment  in part, and dissenting in part) (internal quotat
64 (Thomas, J., concurring in part, concurring in  judgment  in part, and dissenting in part) (quoting Nixon v
ordingly, I respectfully dissent from the Court's  judgment  upholding BCRA §§201 and 311. FOOTNOTESFootnote 1
al Election Commission's (FEC) motion for summary  judgment , App. 261a-262a, any question about statutory val
done "on the basis of entirely subjective, ad hoc  judgment s," 523 U. S., at 690, that suggested anticompetit
539 U. S., at 163-164 (Kennedy, J., concurring in  judgment ). Both Courts also heard criticisms of Austin fro
concurring in part and dissenting in part). In my  judgment , such limitations may be justified to the extent 
"We should defer to [the legislature's] political  judgment  that unlimited spending threatens the integrity o

 HOLLY WOOD, PETITIONER v. RICHARD F. ALLEN, COMMISSIONER, ALABAMA DEPARTMENT OF CORRECTIONS, et al. 
 --------------------------------------------------------------------------------------------------- 

de a strategic decision, but to whether counsel's  judgment  was reasonable, a question not before this Court.
onship to §2254(e)(1). Accordingly, we affirm the  judgment  of the Court of Appeals on that basis. I  In 1993
umption that counsel exercised sound professional  judgment , supported by ample reasons, not to present the i
rategic decision, but rather to whether counsel's  judgment  was reasonable — a question we do not reach. See 
ons were an unreasonable exercise of professional  judgment  and constituted deficient performance under Stric
 itself was a reasonable exercise of professional  judgment  under Strickland or whether the application of St
mination of the facts. Accordingly, we affirm the  judgment  of the Court of Appeals for the Eleventh Circuit.
ess. That was a strategic decision based on their  judgment  that the evidence would do more harm than good. B
resulted from inattention, not reasoned strategic  judgment "); Strickland, 466 U. S., at 690-691. Moreover, "
 itself was a reasonable exercise of professional  judgment  under Strickland or whether the application of St

 AGRON KUCANA v. ERIC H. HOLDER, JR., ATTORNEY GENERAL 
 ----------------------------------------------------- 

.  (1) The amicus defending the Seventh Circuit's  judgment  urges that regulations suffice to trigger §1252(a
laces within the no-judicial-review category "any  judgment  regarding the granting of relief under section 11
ed. Alito, J., filed an opinion concurring in the  judgment . AGRON KUCANA, PETITIONER v. ERIC H. HOLDER,Jr., 
ubparagraph (D),[1] and regardless of whether the  judgment , decision, or action is made in removal proceedin
urt shall have jurisdiction to review--  "(i) any  judgment  regarding the granting of relief under section 11
micus curiae, in support of the Seventh Circuit's  judgment . 557 U. S. ___ (2009). Ms. Leiter has ably discha
mmigration decisions to motions for relief from a  judgment  under Federal Rule of Civil Procedure 60(b)). Fed
Nevertheless, in defense of the Seventh Circuit's  judgment , amicus urges that regulations suffice to trigger
f for Court-Appointed Amicus Curiae in Support of  Judgment  Below 15, 17 (citing, inter alia, Florida Dept. o
laces within the no-judicial-review category "any  judgment  regarding the granting of relief under section 11
  To the clause (i) enumeration of administrative  judgment s that are insulated from judicial review, Congres
dicial review. * * *  For the reasons stated, the  judgment  of the United States Court of Appeals for the Sev
nuary 20, 2010]  Justice Alito, concurring in the  judgment .  I agree that the Court of Appeals had jurisdict
f for Court-Appointed Amicus Curiae in Support of  Judgment  Below 41-42.  Amicus' argument is ingenious but u
f for Court-Appointed Amicus Curiae in Support of  Judgment  Below 19, n. 8 (quoting §1229a(c)(7)(B)). One can
f for Court-Appointed Amicus Curiae in Support of  Judgment  Below. In every one of those examples, Congress e
f for Court-Appointed Amicus Curiae in Support of  Judgment  Below 21-23. But §1252(a)(2)(B)(ii) does not say 
Congress want to exclude review for discretionary  judgment s by the Attorney General that are recited explici
y in the statute, but provide judicial review for  judgment s that are just as lawfully discretionary because 
f for Court-Appointed Amicus Curiae in Support of  Judgment  Below 32-34. The report states that §1252(a)(2)(B

 MARCUS A. WELLONS v. HILTON HALL, WARDEN 
 ---------------------------------------- 

rd for an order granting certiorari, vacating the  judgment  below, and remanding the case (GVR) remains as it
ave to proceed in forma pauperis are granted. The  judgment  is vacated, and the case is remanded to the Eleve
nts Wellons' petition for certiorari, vacates the  judgment  of the Eleventh Circuit, and remands ("GVRs") in 
erse or set the case for argument; otherwise, the  judgment  below must stand. The same is true if (as the Cou
rits question. If they erred in that regard their  judgment  should be reversed rather than remanded "in light
 we are, to vacate and send back their authorized  judgment s for inconsequential imperfection of opinion — as
 authority or development that casts doubt on the  judgment  of the court below. What the Court has done — usi
t of 1996 (AEDPA) to the "Georgia Supreme Court's  judgment  as to the substance and effect of the ex parte co

Vectorization#

Let’s continue with corpus-level analysis by taking advantage of textacy’s vectorizer class, which wraps functionality from scikit-learn to count the prevalence of certain tokens in each document of the corpus and to apply weights to these counts if desired. We could just work directly in scikit-learn, but it can be nice for mental overhead to learn one library and be able to do a great deal with it.

We’ll create a vectorizer, sticking with the normal term frequency defaults but discarding words that appear in fewer than 3 documents or more than 95% of documents. We’ll also limit our features to the top 500 words according to document frequency. This means our feature set, or columns, will have a higher degree of representation across the corpus. We could further scale these counts according to document frequency (or inverse document frequency) weights, or normalize the weights so that they add up to 1 for each document row (L1 norm), and so on.

import textacy.representations

vectorizer = textacy.representations.Vectorizer(min_df=3, max_df=.95, max_n_terms=500)

tokenized_corpus = [[token.orth_ for token in list(textacy.extract.words(doc, filter_nums=True, filter_stops=True, filter_punct=True))] for doc in corpus]

dtm = vectorizer.fit_transform(tokenized_corpus)
dtm

<79x500 sparse matrix of type '<class 'numpy.int32'>'
	with 22870 stored elements in Compressed Sparse Row format>

We have now have a matrix representation of our corpus, where rows are documents, and columns (or features) are words from the corpus. The value at any given point is the number of times that the word appears in that document. Once we have a document-term matrix, we could do several things with it just within textacy, though we also can pass it into different algorithms within scikit-learn or other libraries.

# Let's look at some of the terms
vectorizer.terms_list[:20]

['$',
 '2d',
 '3d',
 'A.',
 'AEDPA',
 'Act',
 'Amendment',
 'American',
 'Ann',
 'Ante',
 'App',
 'Appeals',
 'B',
 'Board',
 'Breyer',
 'Brief',
 'Cert',
 'Cf',
 'Circuit',
 'Citizens']

We can see that we are still getting a number of terms which might be filtered out, such as symbols and abbreviations. The most straightforward solutions are to filter the terms against a dictionary during vectorization, which carries the risk of inadvertently filtering words that you’d prefer to keep in the dataset, or curating a custom stopword list, which can be inflexible and time consuming. Otherwise, it is often the case that the corpus analysis tools used with the vectorized texts (e.g., topic modeling or stylistic analysis – see below) have ways of recognizing and sequestering unwanted terms so that they can be excluded from the results if desired.

Exercise - topic modeling#

Read through the below code to quickly look at one example of what we can do with a vectorized corpus. Topic modeling is very popular for semantic exploration of texts, and there are numerous implementations of it. Textacy uses implementations from scikit-learn. Our corpus is rather small for topic modeling, but just to see how it’s done here, we’ll go ahead. First, though, topic modeling works best when the texts are divided into approximately equal-sized “chunks.” A quick word-count of the corpus will show that the decisions are of quite variable lengths, which will skew the topic model.

for doc in corpus:
    print(f"{len(doc): >5}  {doc._.meta['case_name'][:80]}")

HEMI GROUP, LLC AND KAI GACHUPIN v. CITY OF NEW YORK, NEW YORK
CITIZENS UNITED v. FEDERAL ELECTION COMMISSION
HOLLY WOOD, PETITIONER v. RICHARD F. ALLEN, COMMISSIONER, ALABAMA DEPARTMENT OF 
AGRON KUCANA v. ERIC H. HOLDER, JR., ATTORNEY GENERAL
MARCUS A. WELLONS v. HILTON HALL, WARDEN
ERIC PRESLEY v. GEORGIA
NRG POWER MARKETING, LLC, et al. v. MAINE PUBLIC UTILITIES COMMISSION et al.
E. K. MCDANIEL, WARDEN, et al. v. TROY BROWN
MARYLAND v. MICHAEL BLAINE SHATZER, SR.
THE HERTZ CORPORATION v. MELINDA FRIEND et al.
FLORIDA v. KEVIN DEWAYNE POWELL
RICK THALER, DIRECTOR, TEXAS DEPARTMENT OF CRIMINAL JUSTICE, CORRECTIONAL INSTIT
MAC'S SHELL SERVICE, INC., et al. v. SHELL OIL PRODUCTS CO. LLC et al.
REED ELSEVIER, INC., et al., v. IRVIN MUCHNICK et al.
CURTIS DARNELL JOHNSON v. UNITED STATES
JAMAL KIYEMBA et al. v. BARACK H. OBAMA, PRESIDENT OF THE UNITED STATES et al.
MILAVETZ, GALLOP & MILAVETZ, P. A., et al. v. UNITED STATES
TAYLOR JAMES BLOATE v. UNITED STATES
SHADY GROVE ORTHOPEDIC ASSOCIATES, P. A. v. ALLSTATE INSURANCE COMPANY
JOSE PADILLA v. KENTUCKY
JERRY N. JONES, et al. v. HARRIS ASSOCIATES L. P.
MARY BERGHUIS, WARDEN v. DIAPOLIS SMITH
GRAHAM COUNTY SOIL AND WATER CONSERVATION DISTRICT, et al. v. UNITED STATES ex r
UNITED STUDENT AID FUNDS, INC. v. FRANCISCO J. ESPINOSA
ESTHER HUI, et al. v. YANIRA CASTANEDA, AS PERSONAL REPRESENTATIVE OF THE ESTATE
PAUL RENICO, WARDEN v. REGINALD LETT
KEN L. SALAZAR, SECRETARY OF THE INTERIOR, et al. v. FRANK BUONO
STOLT-NIELSEN S. A., et al. v. ANIMALFEEDS INTERNATIONAL CORP.
MERCK & CO., INC., et al. v. RICHARD REYNOLDS et al.
KAREN L. JERMAN v. CARLISLE, MCNELLIE, RINI, KRAMER & ULRICH LPA, et al.
SONNY PERDUE, GOVERNOR OF GEORGIA, et al. v. KENNY A., BY HIS NEXT FRIEND LINDA 
UNITED STATES v. ROBERT J. STEVENS
TIMOTHY MARK CAMERON ABBOTT v. JACQUELYN VAYE ABBOTT
TERRANCE JAMAR GRAHAM v. FLORIDA
UNITED STATES v. GRAYDON EARL COMSTOCK, JR., et al.
JOE HARRIS SULLIVAN v. FLORIDA
AMERICAN NEEDLE, INC. v. NATIONAL FOOTBALL LEAGUE et al.
ARTHUR L. LEWIS, JR., et al. v. CITY OF CHICAGO, ILLINOIS
UNITED STATES v. MARTIN O'BRIEN AND ARTHUR BURGESS
BRIDGET HARDT v. RELIANCE STANDARD LIFE INSURANCE COMPANY
UNITED STATES v. GLENN MARCUS
JOHN ROBERTSON v. UNITED STATES ex rel. WYKENNA WATSON
LAWRENCE JOSEPH JEFFERSON v. STEPHEN UPTON, WARDEN
MOHAMED ALI SAMANTAR v. BASHE ABDI YOUSUF et al.
MARY BERGHUIS, WARDEN v. VAN CHESTER THOMPKINS
RICHARD A. LEVIN, TAX COMMISSIONER OF OHIO v. COMMERCE ENERGY, INC., et al.
THOMAS CARR v. UNITED STATES
MICHAEL GARY BARBER, et al. v. J. E. THOMAS, WARDEN
JAN HAMILTON, CHAPTER 13 TRUSTEE v. STEPHANIE KAY LANNING
WANDA KRUPSKI v. COSTA CROCIERE S. P. A.
JOSE ANGEL CARACHURI-ROSENDO v. ERIC H. HOLDER, JR., ATTORNEY GENERAL
MICHAEL J. ASTRUE, COMMISSIONER OF SOCIAL SECURITY v. CATHERINE G. RATLIFF
BRIAN RUSSELL DOLAN v. UNITED STATES
ALBERT HOLLAND v. FLORIDA
NEW PROCESS STEEL, L. P. v. NATIONAL LABOR RELATIONS BOARD
STOP THE BEACH RENOURISHMENT, INC. v. FLORIDA DEPARTMENT OF ENVIRONMENTAL PROTEC
CITY OF ONTARIO, CALIFORNIA, et al. v. JEFF QUON et al.
WILLIAM G. SCHWAB v. NADEJDA REILLY
PERCY DILLON v. UNITED STATES
ERIC H. HOLDER, JR., ATTORNEY GENERAL, et al. v. HUMANITARIAN LAW PROJECT et al.
RENT-A-CENTER, WEST, INC. v. ANTONIO JACKSON
KAWASAKI KISEN KAISHA LTD. et al. v. REGAL-BELOIT CORP. et al.
MONSANTO COMPANY, et al. v. GEERTSON SEED FARMS et al.
JOHN DOE #1, et al. v. SAM REED, WASHINGTON SECRETARY OF STATE, et al.
ROBERT MORRISON, et al. v. NATIONAL AUSTRALIA BANK LTD. et al.
GRANITE ROCK COMPANY v. INTERNATIONAL BROTHERHOOD OF TEAMSTERS et al.
BILLY JOE MAGWOOD v. TONY PATTERSON, WARDEN, et al.
JEFFREY K. SKILLING v. UNITED STATES
CONRAD M. BLACK, JOHN A. BOULTBEE, AND MARK S. KIPNIS v. UNITED STATES
FREE ENTERPRISE FUND AND BECKSTEAD AND WATTS, LLP v. PUBLIC COMPANY ACCOUNTING O
BERNARD L. BILSKI AND RAND A. WARSAW v. DAVID J. KAPPOS, UNDER SECRETARY OF COMM
CHRISTIAN LEGAL SOCIETY CHAPTER OF THE UNIVERSITY OF CALIFORNIA, HASTINGS COLLEG
OTIS MCDONALD, et al. v. CITY OF CHICAGO, ILLINOIS, et al.
DEMARCUS ALI SEARS v. STEPHEN UPTON, WARDEN
BILL K. WILSON, SUPERINTENDENT, INDIANA STATE PRISON, PETITIONER v. JOSEPH E. CO
KEVIN ABBOTT, PETITIONER v. UNITED STATES
LOS ANGELES COUNTY, CALIFORNIA, PETITIONER v. CRAIG ARTHUR HUMPHRIES et al.
COSTCO WHOLESALE CORPORATION, PETITIONER v. OMEGA, S.A.
KEITH SMITH, WARDEN v. FRANK G. SPISAK, JR.

We’ll re-chunk the texts into documents of not more than 500 words and then recompute the document-term matrix.

chunked_corpus_unflattened = [
    [text[x:x+500] for x in range(0, len(text), 500)] for text in tokenized_corpus
]
chunked_corpus = list(itertools.chain.from_iterable(chunked_corpus_unflattened))
chunked_dtm = vectorizer.fit_transform(chunked_corpus)
chunked_dtm

<1006x500 sparse matrix of type '<class 'numpy.int32'>'
	with 91636 stored elements in Compressed Sparse Row format>

import textacy.tm

model = textacy.tm.TopicModel("lda", n_topics=15)
model.fit(chunked_dtm)
doc_topic_matrix = model.transform(chunked_dtm)

for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, top_n=10):
  print(f"{topic_idx: >2} {model.topic_weights(doc_topic_matrix)[topic_idx]: >3.0%}", "|", ", ".join(top_terms))

8% | right, rights, Clause, States, Justice, state, Amendment, bear, Constitution, law
4% | fees, carrier, party, attorney, fee, award, services, filed, $, Rule
4% | rights, child, right, Convention, State, custody, Ann, States, Stat, A.
4% | debtor, income, value, felony, delay, time, claimed, Code, property, exempt
8% | sentence, sentencing, time, life, habeas, year, State, federal, years, application
5% | counsel, attorney, Miranda, state, suspect, interrogation, police, right, evidence, advice
1% | business, Director, Office, General, place, corporation, patent, Fed, method, State
4% | Footnote, arbitration, agreement, parties, contract, dispute, clause, Inc., question, Brief
6% | process, petition, disclosure, plaintiffs, challenge, test, applied, referendum, claim, State
11% | Amendment, speech, Hastings, J., public, political, interest, policy, Government, corporations
9% | jury, trial, Id., jurors, d., evidence, District, App, judge, reasonable
11% | Congress, Board, States, United, statute, power, Act, Commission, authority, foreign
8% | F., United, States, 3d, statute, Congress, law, error, criminal, conduct
12% | state, federal, law, courts, Act, action, class, Rule, claims, City
3% | cross, debt, District, injunction, relief, land, transfer, Government, bankruptcy, agency

[(terrible, -0.8), (awful, -0.78), (fantastic, 0.9), (bicycle, 0.01), (pizza, 0.02), (super, 0.85)]

Document similarity with word2vec and clustering#

spaCy and textacy provide several built-in methods for measuring the degree of similarity between two documents, including a word2vec-based approach that computes the semantic similarity between documents based on the word vector model included with the spaCy language model. This technique is capable of inferring, for example, that two documents are topically related even if they don’t share any words but use synonyms for a shared concept.

To evaluate this similarity comparison, we’ll compute the similarity of each pair of docs in the corpus, and then branch out into scikit-learn a bit to look for clusters based on these similarity measurements.

import numpy as np

dim = corpus.n_docs

distance_matrix = np.zeros((dim,dim))
    
for i, doc_i in enumerate(corpus):
    for j, doc_j in enumerate(corpus):
        if i == j:
            continue # defaults to 0
        if i > j:
            distance_matrix[i,j] = distance_matrix[j,i]
        else:
            distance_matrix[i,j] = 1 - doc_i.similarity(doc_j)
distance_matrix

array([[0.        , 0.00428384, 0.00913359, ..., 0.0036863 , 0.05862846,
        0.00781636],
       [0.00428384, 0.        , 0.00910768, ..., 0.00630599, 0.05399509,
        0.0058778 ],
       [0.00913359, 0.00910768, 0.        , ..., 0.00614726, 0.04253501,
        0.00566709],
       ...,
       [0.0036863 , 0.00630599, 0.00614726, ..., 0.        , 0.05483739,
        0.00724887],
       [0.05862846, 0.05399509, 0.04253501, ..., 0.05483739, 0.        ,
        0.04504147],
       [0.00781636, 0.0058778 , 0.00566709, ..., 0.00724887, 0.04504147,
        0.        ]])

The OPTICS hierarchical density-based clustering algorithm only finds one cluster with its default settings, but an examination of the legal issue types coded to each decision indicates that the word2vec-based clustering has indeed produced a group of semantically related documents.

from sklearn.cluster import OPTICS

clustering = OPTICS(metric='precomputed').fit(distance_matrix)
print(clustering.labels_)

[-1  0 -1 -1  1 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1  0 -1  0 -1 -1 -1
 -1  1 -1 -1 -1  0 -1 -1 -1 -1  0 -1 -1 -1 -1 -1  1 -1 -1  0  1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  0 -1 -1 -1  0  0 -1 -1  1 -1 -1 -1  0
  0  1 -1 -1 -1 -1  1]

from itertools import groupby
clusters = groupby(sorted(enumerate(clustering.labels_), key=lambda x: x[1]), lambda x: x[1])

for cluster_label, docs in clusters:
    
    if cluster_label == -1:
        continue

    print(f"Cluster {cluster_label}", "\n---------")
    print("\n".join(
        f"{corpus[i]._.meta['us_cite_id']: <12} | {data.issue_area_codes[corpus[i]._.meta['issue_area']]: <18}"
        f" | {data.issue_codes[corpus[i]._.meta['issue']][:60]}"
        for i, _ in docs
    ))
    print("\n\n")

Cluster 0 
---------
U.S. 310 | First Amendment    | campaign spending (cf. governmental corruption):
U.S. 393 | Judicial Power     | Federal Rules of Civil Procedure including Supreme Court Rul
U.S. 335 | Economic Activity  | federal or state regulation of securities
U.S. 573 | Civil Rights       | debtors' rights
U.S. 126 | Federalism         | national supremacy: miscellaneous
U.S. 305 | Economic Activity  | liability, other than as in sufficiency of evidence, electio
U.S. 1   | First Amendment    | federal or state internal security legislation: Smith, Inter
U.S. 186 | Privacy            | Freedom of Information Act and related federal or state stat
U.S. 247 | Economic Activity  | federal or state regulation of securities
U.S. 661 | First Amendment    | free exercise of religion
U.S. 742 | Criminal Procedure | miscellaneous criminal procedure (cf. due process, prisoners



Cluster 1 
---------
U.S. 220 | Criminal Procedure | discovery and inspection (in the context of criminal litigat
U.S. 98  | Criminal Procedure | Miranda warnings
U.S. 766 | Criminal Procedure | habeas corpus
U.S. 258 | Criminal Procedure | Federal Rules of Criminal Procedure
U.S. 370 | Criminal Procedure | Miranda warnings
U.S. 358 | Criminal Procedure | statutory construction of criminal laws: fraud
U.S. 945 | Criminal Procedure | right to counsel (cf. indigents appointment of counsel or in
U.S. 139 | Criminal Procedure | cruel and unusual punishment, death penalty (cf. extra legal

clean

['assyrian',
 'monarchs',
 'especially',
 'sardanapalus',
 'babylon',
 'scene',
 'great',
 'intellectual',
 'activity',
 'sardanapalus',
 'assyrian',
 'babylon',
 'ized',
 'library',
 'library',
 'paper',
 'clay',
 'tablets',
 'writing',
 'mesopotamia',
 'early',
 'sumerian',
 'days',
 'collection',
 'unearthed',
 'precious',
 'store',
 'historical',
 'material',
 'world',
 'chaldean',
 'line',
 'babylonian',
 'monarchs',
 'nabonidus',
 'keener',
 'literary',
 'tastes',
 'patronized',
 'antiquarian',
 'researches',
 'date',
 'worked',
 'investigators',
 'accession',
 'sargon',
 'commemorated',
 'fact',
 'inscriptions',
 'signs',
 'disunion',
 'empire',
 'sought',
 'centralize',
 'bringing',
 'number',
 'local',
 'gods',
 'babylon',
 'setting',
 'temples',
 'device',
 'practised',
 'successfully',
 'romans',
 'later',
 'times',
 'babylon',
 'roused',
 'jealousy',
 'powerful',
 'priesthood',
 'bel',
 'marduk',
 'dominant',
 'god',
 'babylonians',
 'cast',
 'possible',
 'alternative',
 'nabonidus',
 'found',
 'cyrus',
 'persian',
 'ruler',
 'adjacent',
 'median',
 'empire',
 'cyrus',
 'distinguished',
 'conquering',
 'croesus',
 'rich',
 'king',
 'lydia',
 'eastern',
 'asia',
 'minor',
 'came',
 'babylon',
 'battle',
 'outside',
 'walls',
 'gates',
 'city',
 'opened',
 'soldiers',
 'entered',
 'city',
 'fighting',
 'crown',
 'prince',
 'belshazzar',
 'son',
 'nabonidus',
 'feasting',
 'bible',
 'relates',
 'hand',
 'appeared',
 'wrote',
 'letters',
 'fire',
 'wall',
 'mystical',
 'words',
 'mene',
 'mene',
 'tekel',
 'upharsin',
 'interpreted',
 'prophet',
 'daniel',
 'summoned',
 'read',
 'riddle',
 'god',
 'numbered',
 'thy',
 'kingdom',
 'finished',
 'thou',
 'art',
 'weighed',
 'balance',
 'found',
 'wanting',
 'thy',
 'kingdom',
 'given',
 'medes',
 'persians',
 'possibly',
 'priests',
 'bel',
 'marduk',
 'knew',
 'writing',
 'wall',
 'belshazzar',
 'killed',
 'night',
 'says',
 'bible',
 'nabonidus',
 'taken',
 'prisoner',
 'occupation',
 'city',
 'peaceful',
 'services',
 'bel',
 'marduk',
 'continued',
 'intermission']

Exercises#

Filter the tokens from the HG Well’s text variable to 1) lowercase all text, 2) remove punctuation, 3) remove spaces and line breaks, 4) remove numbers, and 5) remove stopwords - all in one line!
Read through the spacy101 guide and begin to apply its principles to your own corpus: https://spacy.io/usage/spacy-101

Topic modeling - going further#

There are many different approaches to modeling abstract topics in text data, such as top2vec and lda2vec.

Click ahead to see our coverage of the BERTopic algorithm in Chapter 10!

Text Analysis and Machine Learning (TAML) Group

Chapter 8 - spaCy and textaCy

Contents

Chapter 8 - spaCy and textaCy#

Why spaCy and textacy?#

Topics#

A brief word about terms#

Document-level analysis with `spaCy`#

Tokenization#

Filtering tokens#

Counting tokens#

Part-of-speech tagging#

Parsing#

Named-entity recognition#

Visualizing Parses#

Corpus-level analysis with `textacy`#

Generating corpora#

Finding important words in the corpus#

Keyword in context#

Vectorization#

Exercise - topic modeling#

Document similarity with word2vec and clustering#

Exercises#

Topic modeling - going further#

Text Analysis and Machine Learning (TAML) Group

Chapter 8 - spaCy and textaCy

Contents

Chapter 8 - spaCy and textaCy#

Why spaCy and textacy?#

Topics#

A brief word about terms#

Document-level analysis with spaCy#

Tokenization#

Filtering tokens#

Counting tokens#

Part-of-speech tagging#

Parsing#

Named-entity recognition#

Visualizing Parses#

Corpus-level analysis with textacy#

Generating corpora#

Finding important words in the corpus#

Keyword in context#

Vectorization#

Exercise - topic modeling#

Document similarity with word2vec and clustering#

Exercises#

Topic modeling - going further#

Document-level analysis with `spaCy`#

Corpus-level analysis with `textacy`#