Chapter 8 - spaCy and textaCy
Contents
Chapter 8 - spaCy and textaCy#
2023 April 28
These abridged materials are borrowed from the CIDR Workshop Text Analysis with Python
Why spaCy and textacy?#
The language processing features of spaCy and the corpus analysis methods of textacy together offer a wide range of functionality for text analysis in a well-maintained and well-documented software package that incorporates cutting-edge techniques as well as standard approaches.
The “C” in spaCy (and textacy) stands for Cython, which is Python that is compiled to C code and thus offers some performance advantages over interpreted Python, especially when working with large machine-learning models. The use of machine-learning models, including neural networks, is a key feature of spaCy and textacy. The writers of these libraries also have developed Prodigy, a similarly leading-edge but approachable tool for training custom machine-learning models for text analysis, among other uses.
Check out the spaCy 101 to learn more.
Topics#
Document Tokenization
Part-of-Speech (POS) Tagging
Named-Entity Recognition (NER)
Corpus Vectorization
Topic Modeling
Document Similarity
Stylistic Analysis
Note: The examples from this workshop use English texts, but all of the methods are applicable to other languages. The availability of specialized resources (parsing rules, dictionaries, trained models) can vary considerably by language, however.
A brief word about terms#
Text analysis involves extraction of information from significant amounts of free-form text, e.g., literature (prose, poetry), historical records, long-form survey responses, legal documents. Some of the techniques used also are applicable to short-form text data, including documents that are already in tabular format.
Text analysis methods are built upon techniques for Natural Language Processing (NLP), which began as rule-based approaches to parsing human language and eventually incorporated statistical machine learning methods as well as, most recently, neural network/deep learning-based approaches.
Text mining typically refers to the extraction of information from very large corpora of unstructured texts.
# !pip install textacy
import spacy
import textacy
2023-05-20 14:49:20.160588: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Document-level analysis with spaCy
#
Let’s start by learning how spaCy works and using it to begin analyzing a single text document. We’ll work with larger corpora later in the workshop.
For this workshop we will work with a pre-trained statistical and deep-learning model provided by spaCy to process text. spaCy’s models are differentiated by language (21 languages are supported at present), capabilities, training text, and size. Smaller models are more efficient; larger models are more accurate. Here we’ll download and use a medium-sized English multi-task model, which supports part of speech tagging, entity recognition, and includes a word vector model.
!python -m spacy download en_core_web_md
2023-05-20 14:49:27.799248: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
WARNING: Ignoring invalid distribution -lotly (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
WARNING: Ignoring invalid distribution -ensorflow (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
WARNING: Ignoring invalid distribution -lotly (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
WARNING: Ignoring invalid distribution -ensorflow (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
Collecting en-core-web-md==3.3.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.3.0/en_core_web_md-3.3.0-py3-none-any.whl (33.5 MB)
?25l ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/33.5 MB ? eta -:--:--
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.2/33.5 MB 4.4 MB/s eta 0:00:08
╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.7/33.5 MB 10.7 MB/s eta 0:00:04
━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.9/33.5 MB 18.6 MB/s eta 0:00:02
━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/33.5 MB 23.6 MB/s eta 0:00:02
━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.8/33.5 MB 32.6 MB/s eta 0:00:01
━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.4/33.5 MB 36.6 MB/s eta 0:00:01
━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.4/33.5 MB 36.6 MB/s eta 0:00:01
━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.4/33.5 MB 36.6 MB/s eta 0:00:01
━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.0/33.5 MB 27.3 MB/s eta 0:00:01
━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━ 11.8/33.5 MB 37.1 MB/s eta 0:00:01
━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━ 15.2/33.5 MB 42.4 MB/s eta 0:00:01
━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━ 18.8/33.5 MB 86.6 MB/s eta 0:00:01
━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━ 21.7/33.5 MB 88.0 MB/s eta 0:00:01
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━ 24.5/33.5 MB 83.0 MB/s eta 0:00:01
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━ 27.3/33.5 MB 80.9 MB/s eta 0:00:01
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━ 30.5/33.5 MB 78.8 MB/s eta 0:00:01
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 33.4/33.5 MB 84.9 MB/s eta 0:00:01
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 33.4/33.5 MB 84.9 MB/s eta 0:00:01
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 33.4/33.5 MB 84.9 MB/s eta 0:00:01
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33.5/33.5 MB 39.4 MB/s eta 0:00:00
?25h
Requirement already satisfied: spacy<3.4.0,>=3.3.0.dev0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from en-core-web-md==3.3.0) (3.3.1)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (3.0.6)
Requirement already satisfied: pathy>=0.3.5 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (0.10.1)
Requirement already satisfied: wasabi<1.1.0,>=0.9.1 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (0.10.1)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (2.0.6)
Requirement already satisfied: numpy>=1.15.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (1.23.5)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (1.0.1)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (2.0.7)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (2.4.4)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (1.8.2)
Requirement already satisfied: jinja2 in /Users/evanmuzzall/.local/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (3.1.2)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (1.0.7)
Requirement already satisfied: thinc<8.1.0,>=8.0.14 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (8.0.17)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (3.3.0)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.9 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (3.0.10)
Requirement already satisfied: typer<0.5.0,>=0.3.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (0.4.1)
Requirement already satisfied: packaging>=20.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (23.0)
Requirement already satisfied: setuptools in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (66.1.1)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (0.7.7)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (2.28.2)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (4.64.1)
Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from pathy>=0.3.5->spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (5.2.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4->spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (4.4.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (3.0.1)
Requirement already satisfied: certifi>=2017.4.17 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (2022.12.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (1.26.14)
Requirement already satisfied: idna<4,>=2.5 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (3.4)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from typer<0.5.0,>=0.3.0->spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (8.1.3)
Requirement already satisfied: MarkupSafe>=2.0 in /Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages (from jinja2->spacy<3.4.0,>=3.3.0.dev0->en-core-web-md==3.3.0) (2.1.2)
WARNING: Ignoring invalid distribution -lotly (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
WARNING: Ignoring invalid distribution -ensorflow (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
WARNING: Ignoring invalid distribution -lotly (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
WARNING: Ignoring invalid distribution -ensorflow (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
WARNING: Ignoring invalid distribution -lotly (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
WARNING: Ignoring invalid distribution -ensorflow (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
WARNING: Ignoring invalid distribution -lotly (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
WARNING: Ignoring invalid distribution -ensorflow (/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages)
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_md')
# Once we've installed the model, we can import it like any other Python library
import en_core_web_md
# This instantiates a spaCy text processor based on the installed model
nlp = en_core_web_md.load()
# From H.G. Wells's A Short History of the World, Project Gutenberg
text = """Even under the Assyrian monarchs and especially under
Sardanapalus, Babylon had been a scene of great intellectual
activity. {111} Sardanapalus, though an Assyrian, had been quite
Babylon-ized. He made a library, a library not of paper but of
the clay tablets that were used for writing in Mesopotamia since
early Sumerian days. His collection has been unearthed and is
perhaps the most precious store of historical material in the
world. The last of the Chaldean line of Babylonian monarchs,
Nabonidus, had even keener literary tastes. He patronized
antiquarian researches, and when a date was worked out by his
investigators for the accession of Sargon I he commemorated the
fact by inscriptions. But there were many signs of disunion in
his empire, and he sought to centralize it by bringing a number of
the various local gods to Babylon and setting up temples to them
there. This device was to be practised quite successfully by the
Romans in later times, but in Babylon it roused the jealousy of
the powerful priesthood of Bel Marduk, the dominant god of the
Babylonians. They cast about for a possible alternative to
Nabonidus and found it in Cyrus the Persian, the ruler of the
adjacent Median Empire. Cyrus had already distinguished himself
by conquering Croesus, the rich king of Lydia in Eastern Asia
Minor. {112} He came up against Babylon, there was a battle
outside the walls, and the gates of the city were opened to him
(538 B.C.). His soldiers entered the city without fighting. The
crown prince Belshazzar, the son of Nabonidus, was feasting, the
Bible relates, when a hand appeared and wrote in letters of fire
upon the wall these mystical words: _"Mene, Mene, Tekel,
Upharsin,"_ which was interpreted by the prophet Daniel, whom he
summoned to read the riddle, as "God has numbered thy kingdom and
finished it; thou art weighed in the balance and found wanting and
thy kingdom is given to the Medes and Persians." Possibly the
priests of Bel Marduk knew something about that writing on the
wall. Belshazzar was killed that night, says the Bible.
Nabonidus was taken prisoner, and the occupation of the city was
so peaceful that the services of Bel Marduk continued without
intermission."""
By default, spaCy applies its entire NLP “pipeline” to the text as soon as it is provided to the model and outputs a processed “doc.”
doc = nlp(text)
Tokenization#
The doc created by spaCy immediately provides access to the word-level tokens of the text.
for token in doc[:15]:
print(token)
Even
under
the
Assyrian
monarchs
and
especially
under
Sardanapalus
,
Babylon
had
been
a
Each of these tokens has a number of properties, and we’ll look a bit more closely at them in a minute.
spaCy also automatically provides sentence-level segmenting (senticization).
import itertools
for sent in itertools.islice(doc.sents, 10):
print(sent.text + "\n--\n")
Even under the Assyrian monarchs and especially under
Sardanapalus, Babylon had been a scene of great intellectual
activity.
--
{111} Sardanapalus, though an Assyrian, had been quite
Babylon-ized.
--
He made a library, a library not of paper but of
the clay tablets that were used for writing in Mesopotamia since
early Sumerian days.
--
His collection has been unearthed and is
perhaps the most precious store of historical material in the
world.
--
The last of the Chaldean line of Babylonian monarchs,
Nabonidus, had even keener literary tastes.
--
He patronized
antiquarian researches, and when a date was worked out by his
investigators for the accession of Sargon I he commemorated the
fact by inscriptions.
--
But there were many signs of disunion in
his empire, and he sought to centralize it by bringing a number of
the various local gods to Babylon and setting up temples to them
there.
--
This device was to be practised quite successfully by the
Romans in later times, but in Babylon it roused the jealousy of
the powerful priesthood of Bel Marduk, the dominant god of the
Babylonians.
--
They cast about for a possible alternative to
Nabonidus and found it in Cyrus the Persian, the ruler of the
adjacent Median Empire.
--
Cyrus had already distinguished himself
by conquering Croesus, the rich king of Lydia in Eastern Asia
Minor.
--
You’ll notice that the line breaks in the sample text are making the extracted sentences and also the word-level tokens a bit messy. The simplest way to avoid this is just to replace all single line breaks from the text with spaces before running it throug the spaCy pipeline, i.e., as a preprocessing step.
There are other ways to handle this within the spaCy pipeline; an important feature of spaCy is that every phase of the built-in pipeline can be replaced by a custom module. One could imagine, for example, writing a replacement sentencizer that takes advantage of the presence of two spaces between all sentences in the sample text. But we will leave that as an exercise for the reader.
text_as_line = text.replace("\n", " ")
doc = nlp(text_as_line)
for sent in itertools.islice(doc.sents, 10):
print(sent.text + "\n--\n")
Even under the Assyrian monarchs and especially under Sardanapalus, Babylon had been a scene of great intellectual activity.
--
{111} Sardanapalus, though an Assyrian, had been quite Babylon-ized.
--
He made a library, a library not of paper but of the clay tablets that were used for writing in Mesopotamia since early Sumerian days.
--
His collection has been unearthed and is perhaps the most precious store of historical material in the world.
--
The last of the Chaldean line of Babylonian monarchs, Nabonidus, had even keener literary tastes.
--
He patronized antiquarian researches, and when a date was worked out by his investigators for the accession of Sargon I he commemorated the fact by inscriptions.
--
But there were many signs of disunion in his empire, and he sought to centralize it by bringing a number of the various local gods to Babylon and setting up temples to them there.
--
This device was to be practised quite successfully by the Romans in later times, but in Babylon it roused the jealousy of the powerful priesthood of Bel Marduk, the dominant god of the Babylonians.
--
They cast about for a possible alternative to Nabonidus and found it in Cyrus the Persian, the ruler of the adjacent Median Empire.
--
Cyrus had already distinguished himself by conquering Croesus, the rich king of Lydia in Eastern Asia Minor.
--
We can collect both words and sentences into standard Python data structures (lists, in this case).
doc.sents
<generator at 0x7f8ecbf66ea0>
sentences = [sent.text for sent in doc.sents]
sentences
['Even under the Assyrian monarchs and especially under Sardanapalus, Babylon had been a scene of great intellectual activity.',
' {111} Sardanapalus, though an Assyrian, had been quite Babylon-ized.',
' He made a library, a library not of paper but of the clay tablets that were used for writing in Mesopotamia since early Sumerian days.',
' His collection has been unearthed and is perhaps the most precious store of historical material in the world.',
' The last of the Chaldean line of Babylonian monarchs, Nabonidus, had even keener literary tastes.',
' He patronized antiquarian researches, and when a date was worked out by his investigators for the accession of Sargon I he commemorated the fact by inscriptions.',
' But there were many signs of disunion in his empire, and he sought to centralize it by bringing a number of the various local gods to Babylon and setting up temples to them there.',
' This device was to be practised quite successfully by the Romans in later times, but in Babylon it roused the jealousy of the powerful priesthood of Bel Marduk, the dominant god of the Babylonians.',
' They cast about for a possible alternative to Nabonidus and found it in Cyrus the Persian, the ruler of the adjacent Median Empire.',
' Cyrus had already distinguished himself by conquering Croesus, the rich king of Lydia in Eastern Asia Minor.',
' {112} He came up against Babylon, there was a battle outside the walls, and the gates of the city were opened to him (538 B.C.).',
' His soldiers entered the city without fighting.',
' The crown prince Belshazzar, the son of Nabonidus, was feasting, the Bible relates, when a hand appeared and wrote in letters of fire upon the wall these mystical words: _"Mene, Mene, Tekel, Upharsin,"_ which was interpreted by the prophet Daniel, whom he summoned to read the riddle, as "God has numbered thy kingdom and finished it; thou art weighed in the balance and found wanting and thy kingdom is given to the Medes and Persians."',
' Possibly the priests of Bel Marduk knew something about that writing on the wall.',
' Belshazzar was killed that night, says the Bible.',
'Nabonidus was taken prisoner, and the occupation of the city was so peaceful that the services of Bel Marduk continued without intermission.']
words = [token.text for token in doc]
words
['Even',
'under',
'the',
'Assyrian',
'monarchs',
'and',
'especially',
'under',
'Sardanapalus',
',',
'Babylon',
'had',
'been',
'a',
'scene',
'of',
'great',
'intellectual',
'activity',
'.',
' ',
'{',
'111',
'}',
'Sardanapalus',
',',
'though',
'an',
'Assyrian',
',',
'had',
'been',
'quite',
'Babylon',
'-',
'ized',
'.',
' ',
'He',
'made',
'a',
'library',
',',
'a',
'library',
'not',
'of',
'paper',
'but',
'of',
'the',
'clay',
'tablets',
'that',
'were',
'used',
'for',
'writing',
'in',
'Mesopotamia',
'since',
'early',
'Sumerian',
'days',
'.',
' ',
'His',
'collection',
'has',
'been',
'unearthed',
'and',
'is',
'perhaps',
'the',
'most',
'precious',
'store',
'of',
'historical',
'material',
'in',
'the',
'world',
'.',
' ',
'The',
'last',
'of',
'the',
'Chaldean',
'line',
'of',
'Babylonian',
'monarchs',
',',
'Nabonidus',
',',
'had',
'even',
'keener',
'literary',
'tastes',
'.',
' ',
'He',
'patronized',
'antiquarian',
'researches',
',',
'and',
'when',
'a',
'date',
'was',
'worked',
'out',
'by',
'his',
'investigators',
'for',
'the',
'accession',
'of',
'Sargon',
'I',
'he',
'commemorated',
'the',
'fact',
'by',
'inscriptions',
'.',
' ',
'But',
'there',
'were',
'many',
'signs',
'of',
'disunion',
'in',
'his',
'empire',
',',
'and',
'he',
'sought',
'to',
'centralize',
'it',
'by',
'bringing',
'a',
'number',
'of',
'the',
'various',
'local',
'gods',
'to',
'Babylon',
'and',
'setting',
'up',
'temples',
'to',
'them',
'there',
'.',
' ',
'This',
'device',
'was',
'to',
'be',
'practised',
'quite',
'successfully',
'by',
'the',
'Romans',
'in',
'later',
'times',
',',
'but',
'in',
'Babylon',
'it',
'roused',
'the',
'jealousy',
'of',
'the',
'powerful',
'priesthood',
'of',
'Bel',
'Marduk',
',',
'the',
'dominant',
'god',
'of',
'the',
'Babylonians',
'.',
' ',
'They',
'cast',
'about',
'for',
'a',
'possible',
'alternative',
'to',
'Nabonidus',
'and',
'found',
'it',
'in',
'Cyrus',
'the',
'Persian',
',',
'the',
'ruler',
'of',
'the',
'adjacent',
'Median',
'Empire',
'.',
' ',
'Cyrus',
'had',
'already',
'distinguished',
'himself',
'by',
'conquering',
'Croesus',
',',
'the',
'rich',
'king',
'of',
'Lydia',
'in',
'Eastern',
'Asia',
'Minor',
'.',
' ',
'{',
'112',
'}',
'He',
'came',
'up',
'against',
'Babylon',
',',
'there',
'was',
'a',
'battle',
'outside',
'the',
'walls',
',',
'and',
'the',
'gates',
'of',
'the',
'city',
'were',
'opened',
'to',
'him',
'(',
'538',
'B.C.',
')',
'.',
' ',
'His',
'soldiers',
'entered',
'the',
'city',
'without',
'fighting',
'.',
' ',
'The',
'crown',
'prince',
'Belshazzar',
',',
'the',
'son',
'of',
'Nabonidus',
',',
'was',
'feasting',
',',
'the',
'Bible',
'relates',
',',
'when',
'a',
'hand',
'appeared',
'and',
'wrote',
'in',
'letters',
'of',
'fire',
'upon',
'the',
'wall',
'these',
'mystical',
'words',
':',
'_',
'"',
'Mene',
',',
'Mene',
',',
'Tekel',
',',
'Upharsin',
',',
'"',
'_',
'which',
'was',
'interpreted',
'by',
'the',
'prophet',
'Daniel',
',',
'whom',
'he',
'summoned',
'to',
'read',
'the',
'riddle',
',',
'as',
'"',
'God',
'has',
'numbered',
'thy',
'kingdom',
'and',
'finished',
'it',
';',
'thou',
'art',
'weighed',
'in',
'the',
'balance',
'and',
'found',
'wanting',
'and',
'thy',
'kingdom',
'is',
'given',
'to',
'the',
'Medes',
'and',
'Persians',
'.',
'"',
' ',
'Possibly',
'the',
'priests',
'of',
'Bel',
'Marduk',
'knew',
'something',
'about',
'that',
'writing',
'on',
'the',
'wall',
'.',
' ',
'Belshazzar',
'was',
'killed',
'that',
'night',
',',
'says',
'the',
'Bible',
'.',
'Nabonidus',
'was',
'taken',
'prisoner',
',',
'and',
'the',
'occupation',
'of',
'the',
'city',
'was',
'so',
'peaceful',
'that',
'the',
'services',
'of',
'Bel',
'Marduk',
'continued',
'without',
'intermission',
'.']
Filtering tokens#
After extracting the tokens, we can use some attributes and methods provided by spaCy, along with some vanilla Python methods, to filter the tokens to just the types we’re interested in analyzing.
# If we're only interested in analyzing word tokens, we can remove punctuation:
for token in doc[:20]:
print(f'TOKEN: {token.text:15} IS_PUNCTUATION: {token.is_punct:}')
no_punct = [token for token in doc if token.is_punct == False]
no_punct[:20]
TOKEN: Even IS_PUNCTUATION: False
TOKEN: under IS_PUNCTUATION: False
TOKEN: the IS_PUNCTUATION: False
TOKEN: Assyrian IS_PUNCTUATION: False
TOKEN: monarchs IS_PUNCTUATION: False
TOKEN: and IS_PUNCTUATION: False
TOKEN: especially IS_PUNCTUATION: False
TOKEN: under IS_PUNCTUATION: False
TOKEN: Sardanapalus IS_PUNCTUATION: False
TOKEN: , IS_PUNCTUATION: True
TOKEN: Babylon IS_PUNCTUATION: False
TOKEN: had IS_PUNCTUATION: False
TOKEN: been IS_PUNCTUATION: False
TOKEN: a IS_PUNCTUATION: False
TOKEN: scene IS_PUNCTUATION: False
TOKEN: of IS_PUNCTUATION: False
TOKEN: great IS_PUNCTUATION: False
TOKEN: intellectual IS_PUNCTUATION: False
TOKEN: activity IS_PUNCTUATION: False
TOKEN: . IS_PUNCTUATION: True
[Even,
under,
the,
Assyrian,
monarchs,
and,
especially,
under,
Sardanapalus,
Babylon,
had,
been,
a,
scene,
of,
great,
intellectual,
activity,
,
111]
# There are still some space tokens; here's how to remove spaces and newlines:
no_punct_or_space = [token for token in doc if token.is_punct == False and token.is_space == False]
for token in no_punct_or_space[:30]:
print(token.text)
Even
under
the
Assyrian
monarchs
and
especially
under
Sardanapalus
Babylon
had
been
a
scene
of
great
intellectual
activity
111
Sardanapalus
though
an
Assyrian
had
been
quite
Babylon
ized
He
made
# Let's say we also want to remove numbers and lowercase everything that remains
lower_alpha = [token.lower_ for token in no_punct_or_space if token.is_alpha == True]
lower_alpha[:30]
['even',
'under',
'the',
'assyrian',
'monarchs',
'and',
'especially',
'under',
'sardanapalus',
'babylon',
'had',
'been',
'a',
'scene',
'of',
'great',
'intellectual',
'activity',
'sardanapalus',
'though',
'an',
'assyrian',
'had',
'been',
'quite',
'babylon',
'ized',
'he',
'made',
'a']
One additional common filtering step is to remove stopwords. In theory, stopwords can be any words we’re not interested in analyzing, but in practice, they are often the most common words in a language that do not carry much semantic information (e.g., articles, conjunctions).
clean = [token.lower_ for token in no_punct_or_space if token.is_alpha == True and token.is_stop == False]
clean[:30]
['assyrian',
'monarchs',
'especially',
'sardanapalus',
'babylon',
'scene',
'great',
'intellectual',
'activity',
'sardanapalus',
'assyrian',
'babylon',
'ized',
'library',
'library',
'paper',
'clay',
'tablets',
'writing',
'mesopotamia',
'early',
'sumerian',
'days',
'collection',
'unearthed',
'precious',
'store',
'historical',
'material',
'world']
We’ve used spaCy’s built-in stopword list; membership in this list determines the property is_stop
for each token. It’s good practice to be wary of any built-in stopword list, however – there’s a good chance you will want to remove some words that aren’t on the list and to include some that are, especially if you’re working with specialized texts.
# We'll just pick a couple of words we know are in the example
custom_stopwords = ["assyrian", "babylon"]
custom_clean = [token for token in clean if token not in custom_stopwords]
custom_clean
['monarchs',
'especially',
'sardanapalus',
'scene',
'great',
'intellectual',
'activity',
'sardanapalus',
'ized',
'library',
'library',
'paper',
'clay',
'tablets',
'writing',
'mesopotamia',
'early',
'sumerian',
'days',
'collection',
'unearthed',
'precious',
'store',
'historical',
'material',
'world',
'chaldean',
'line',
'babylonian',
'monarchs',
'nabonidus',
'keener',
'literary',
'tastes',
'patronized',
'antiquarian',
'researches',
'date',
'worked',
'investigators',
'accession',
'sargon',
'commemorated',
'fact',
'inscriptions',
'signs',
'disunion',
'empire',
'sought',
'centralize',
'bringing',
'number',
'local',
'gods',
'setting',
'temples',
'device',
'practised',
'successfully',
'romans',
'later',
'times',
'roused',
'jealousy',
'powerful',
'priesthood',
'bel',
'marduk',
'dominant',
'god',
'babylonians',
'cast',
'possible',
'alternative',
'nabonidus',
'found',
'cyrus',
'persian',
'ruler',
'adjacent',
'median',
'empire',
'cyrus',
'distinguished',
'conquering',
'croesus',
'rich',
'king',
'lydia',
'eastern',
'asia',
'minor',
'came',
'battle',
'outside',
'walls',
'gates',
'city',
'opened',
'soldiers',
'entered',
'city',
'fighting',
'crown',
'prince',
'belshazzar',
'son',
'nabonidus',
'feasting',
'bible',
'relates',
'hand',
'appeared',
'wrote',
'letters',
'fire',
'wall',
'mystical',
'words',
'mene',
'mene',
'tekel',
'upharsin',
'interpreted',
'prophet',
'daniel',
'summoned',
'read',
'riddle',
'god',
'numbered',
'thy',
'kingdom',
'finished',
'thou',
'art',
'weighed',
'balance',
'found',
'wanting',
'thy',
'kingdom',
'given',
'medes',
'persians',
'possibly',
'priests',
'bel',
'marduk',
'knew',
'writing',
'wall',
'belshazzar',
'killed',
'night',
'says',
'bible',
'nabonidus',
'taken',
'prisoner',
'occupation',
'city',
'peaceful',
'services',
'bel',
'marduk',
'continued',
'intermission']
At this point, we have a list of lower-cased tokens that doesn’t contain punctuation, white-space, numbers, or stopwords. Depending on your analytical goals, you may or may not want to do this much cleaning, but hopefully you have a greater appreciation for the kinds of cleaning that can be done with spaCy.
Counting tokens#
Now that we’ve used spaCy to tokenize and clean our text, we can begin one of the most fundamental text analysis tasks: counting words!
print("Number of tokens in document: ", len(doc))
print("Number of tokens in cleaned document: ", len(clean))
print("Number of unique tokens in cleaned document: ", len(set(clean)))
Number of tokens in document: 442
Number of tokens in cleaned document: 175
Number of unique tokens in cleaned document: 147
from collections import Counter
?Counter
from collections import Counter
full_counter = Counter([token.lower_ for token in doc])
full_counter.most_common(20)
[('the', 36),
(',', 26),
('of', 20),
('.', 16),
(' ', 14),
('and', 13),
('in', 9),
('a', 8),
('was', 8),
('to', 8),
('he', 6),
('by', 6),
('babylon', 5),
('had', 4),
('that', 4),
('his', 4),
('nabonidus', 4),
('it', 4),
('"', 4),
('been', 3)]
cleaned_counter = Counter(clean)
cleaned_counter.most_common(20)
[('babylon', 5),
('nabonidus', 4),
('bel', 3),
('marduk', 3),
('city', 3),
('assyrian', 2),
('monarchs', 2),
('sardanapalus', 2),
('library', 2),
('writing', 2),
('empire', 2),
('god', 2),
('found', 2),
('cyrus', 2),
('belshazzar', 2),
('bible', 2),
('wall', 2),
('mene', 2),
('thy', 2),
('kingdom', 2)]
Part-of-speech tagging#
Let’s consider some other aspects of the text that spaCy exposes for us. One of the most noteworthy features is part-of-speech tagging.
# spaCy provides two levels of POS tagging. Here's the more general level.
for token in doc[:30]:
print(token.text, token.pos_)
Even ADV
under ADP
the DET
Assyrian ADJ
monarchs NOUN
and CCONJ
especially ADV
under ADP
Sardanapalus PROPN
, PUNCT
Babylon PROPN
had AUX
been AUX
a DET
scene NOUN
of ADP
great ADJ
intellectual ADJ
activity NOUN
. PUNCT
SPACE
{ PUNCT
111 NUM
} PUNCT
Sardanapalus PROPN
, PUNCT
though SCONJ
an DET
Assyrian PROPN
, PUNCT
# spaCy also provides the more specific Penn Treenbank tags.
# https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
for token in doc[:30]:
print(token.text, token.tag_)
Even RB
under IN
the DT
Assyrian JJ
monarchs NNS
and CC
especially RB
under IN
Sardanapalus NNP
, ,
Babylon NNP
had VBD
been VBN
a DT
scene NN
of IN
great JJ
intellectual JJ
activity NN
. .
_SP
{ -LRB-
111 CD
} -RRB-
Sardanapalus NNP
, ,
though IN
an DT
Assyrian NNP
, ,
We can count the occurrences of each part of speech in the text, which may be useful for document classification (fiction may have different proportions of parts of speech relative to nonfiction, for example) or stylistic analysis (more on that later).
nouns = [token for token in doc if token.pos_ == "NOUN"]
verbs = [token for token in doc if token.pos_ == "VERB"]
proper_nouns = [token for token in doc if token.pos_ == "PROPN"]
adjectives = [token for token in doc if token.pos_ == "ADJ"]
adverbs = [token for token in doc if token.pos_ == "ADV"]
pos_counts = {
"nouns": len(nouns),
"verbs": len(verbs),
"proper_nouns": len(proper_nouns),
"adjectives": len(adjectives),
"adverbs": len(adverbs)
}
pos_counts
{'nouns': 66, 'verbs': 43, 'proper_nouns': 45, 'adjectives': 24, 'adverbs': 12}
spaCy performs morphosyntactic analysis of individual tokens, including lemmatizing inflected or conjugated forms to their base (dictionary) forms. Reducing words to their lemmatized forms can help to make a large corpus more manageable and is generally more effective than just stemming words (trimming the inflected/conjugated endings of words until just the base portion remains), but should only be done if the inflections are not relevant to your analysis.
for token in doc:
if token.pos_ in ["NOUN", "VERB"] and token.orth_ != token.lemma_:
print(f"{token.text:15} {token.lemma_}")
monarchs monarch
ized ize
made make
tablets tablet
used use
writing write
days day
unearthed unearth
monarchs monarch
had have
tastes taste
patronized patronize
researches research
worked work
investigators investigator
commemorated commemorate
inscriptions inscription
were be
signs sign
sought seek
bringing bring
gods god
setting set
temples temple
practised practise
times time
roused rouse
found find
distinguished distinguish
conquering conquer
came come
was be
walls wall
gates gate
opened open
soldiers soldier
entered enter
fighting fight
feasting feast
relates relate
appeared appear
wrote write
letters letter
words word
interpreted interpret
summoned summon
numbered number
finished finish
weighed weigh
found find
wanting want
given give
priests priest
knew know
killed kill
says say
taken take
services service
continued continue
Parsing#
spaCy’s trained models also provide full dependency parsing, tagging word tokens with their syntactic relations to other tokens. This functionality drives spaCy’s built-in senticization as well.
We won’t spend much time exploring this feature, but it’s useful to see how it enables the extraction of multi-word “noun chunks” from the text. Note also that textacy (discussed below) has a built-in function to extract subject-verb-object triples from sentences.
for chunk in itertools.islice(doc.noun_chunks, 20):
print(chunk.text)
the Assyrian monarchs
Sardanapalus
Babylon
a scene
great intellectual activity
{111} Sardanapalus
an Assyrian
He
a library
a library
paper
the clay tablets
that
Mesopotamia
early Sumerian days
His collection
the most precious store
historical material
the world
The last
Named-entity recognition#
spaCy’s models do a pretty good job of identifying and classifying named entities (people, places, organizations).
It is also fairly easy to customize and fine-tune these models by providing additional training data (e.g., texts with entities labeled according to the desired scheme), but that’s out of the scope of this workshop.
for ent in doc.ents:
print(f'{ent.text:20} {ent.label_:15} {spacy.explain(ent.label_)}')
Assyrian NORP Nationalities or religious or political groups
Sardanapalus WORK_OF_ART Titles of books, songs, etc.
Babylon GPE Countries, cities, states
111 CARDINAL Numerals that do not fall under another type
Assyrian NORP Nationalities or religious or political groups
Babylon ORG Companies, agencies, institutions, etc.
Mesopotamia LOC Non-GPE locations, mountain ranges, bodies of water
early Sumerian days DATE Absolute or relative dates or periods
Chaldean NORP Nationalities or religious or political groups
Babylonian NORP Nationalities or religious or political groups
Nabonidus ORG Companies, agencies, institutions, etc.
Sargon ORG Companies, agencies, institutions, etc.
Romans NORP Nationalities or religious or political groups
Babylon GPE Countries, cities, states
Bel Marduk PERSON People, including fictional
Babylonians NORP Nationalities or religious or political groups
Nabonidus ORG Companies, agencies, institutions, etc.
Persian NORP Nationalities or religious or political groups
Croesus PERSON People, including fictional
Lydia PERSON People, including fictional
Eastern Asia Minor LOC Non-GPE locations, mountain ranges, bodies of water
112 CARDINAL Numerals that do not fall under another type
Babylon ORG Companies, agencies, institutions, etc.
538 CARDINAL Numerals that do not fall under another type
B.C. GPE Countries, cities, states
Nabonidus PERSON People, including fictional
Bible WORK_OF_ART Titles of books, songs, etc.
Mene PERSON People, including fictional
Tekel ORG Companies, agencies, institutions, etc.
Upharsin PERSON People, including fictional
Medes NORP Nationalities or religious or political groups
Persians NORP Nationalities or religious or political groups
Bel Marduk PERSON People, including fictional
that night TIME Times smaller than a day
Bible WORK_OF_ART Titles of books, songs, etc.
Nabonidus PERSON People, including fictional
Bel Marduk PERSON People, including fictional
What if we only care about geo-political entities or locations?
ent_filtered = [(ent.text, ent.label_) for ent in doc.ents if ent.label_ in ["GPE", "LOC"]]
ent_filtered
[('Babylon', 'GPE'),
('Mesopotamia', 'LOC'),
('Babylon', 'GPE'),
('Eastern Asia Minor', 'LOC'),
('B.C.', 'GPE')]
Visualizing Parses#
The built-in displaCy visualizer can render the results of the named-entity recognition, as well as the dependency parser.
?displacy.render
Object `displacy.render` not found.
from spacy import displacy
displacy.render(doc, style="dep", jupyter=True)
Corpus-level analysis with textacy
#
Let’s shift to thinking about a whole corpus rather than a single document. We could analyze multiple documents with spaCy and then knit the results together with some extra Python. Instead, though, we’re going to take advantage of textacy, a library built on spaCy that adds corpus analysis features.
For reference, here’s the online documentation for textacy.
Generating corpora#
We’ll use some of the data that is included in textacy as our corpus. It is certainly possible to build your own corpus by importing data from files in plain text, XML, JSON, CSV or other formats, but working with one of textacy’s “pre-cooked” datasets simplifies things a bit.
import textacy.datasets
# We'll work with a dataset of ~8,400 ("almost all") U.S. Supreme Court
# decisions from November 1946 through June 2016
# https://github.com/bdewilde/textacy-data/releases/tag/supreme_court_py3_v1.0
data = textacy.datasets.SupremeCourt()
data.download()
The documentation indicates the metadata that is available with each text.
# help(textacy.datasets.supreme_court)
textacy is based on the concept of a corpus, whereas spaCy focuses on single documents. A textacy corpus is instantiated with a spaCy language model (we’re using the one from the first half of this workshop) that is used to apply its analytical pipeline to each text in the corpus, and also given a set of records consisting of texts with metadata (if metadata is available).
Let’s go ahead and define a set of records (texts with metadata) that we’ll then add to our corpus. To keep the processing time of the data set a bit more manageable, we’ll just look at a set of court decisions from a short span of time.
from IPython.display import display, HTML, clear_output
corpus = textacy.Corpus(nlp)
# There are 79 docs in this range -- they'll take a minute or two to process
recent_decisions = data.records(date_range=('2010-01-01', '2010-12-31'))
for i, record in enumerate(recent_decisions):
clear_output(wait=True)
display(HTML(f"<pre>{i+1:>2}/79: Adding {record[1]['case_name']}</pre>"))
corpus.add_record(record)
# If the three lines above are taking too long to process all 79 docs,
# comment them out and uncomment the two lines below to download and import
# a preprocessed version of the corpus
#!wget https://github.com/sul-cidr/Workshops/raw/master/Text_Analysis_with_Python/data/scotus_2010.bin.gz
#corpus = textacy.Corpus.load(nlp, "scotus_2010.bin.gz")
15/79: Adding CURTIS DARNELL JOHNSON v. UNITED STATES
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[36], line 10
8 clear_output(wait=True)
9 display(HTML(f"<pre>{i+1:>2}/79: Adding {record[1]['case_name']}</pre>"))
---> 10 corpus.add_record(record)
File ~/opt/anaconda3/lib/python3.8/site-packages/textacy/corpus.py:287, in Corpus.add_record(self, record)
279 def add_record(self, record: types.Record) -> None:
280 """
281 Add one record to the corpus, processing it into a :class:`spacy.tokens.Doc`
282 using the :attr:`Corpus.spacy_lang` pipeline.
(...)
285 record
286 """
--> 287 doc = self.spacy_lang(record[0])
288 doc._.meta = record[1]
289 self._add_valid_doc(doc)
File ~/opt/anaconda3/lib/python3.8/site-packages/spacy/language.py:1020, in Language.__call__(self, text, disable, component_cfg)
1018 error_handler = proc.get_error_handler()
1019 try:
-> 1020 doc = proc(doc, **component_cfg.get(name, {})) # type: ignore[call-arg]
1021 except KeyError as e:
1022 # This typically happens if a component is not initialized
1023 raise ValueError(Errors.E109.format(name=name)) from e
File ~/opt/anaconda3/lib/python3.8/site-packages/spacy/pipeline/trainable_pipe.pyx:52, in spacy.pipeline.trainable_pipe.TrainablePipe.__call__()
File ~/opt/anaconda3/lib/python3.8/site-packages/spacy/pipeline/tok2vec.py:125, in Tok2Vec.predict(self, docs)
123 width = self.model.get_dim("nO")
124 return [self.model.ops.alloc((0, width)) for doc in docs]
--> 125 tokvecs = self.model.predict(docs)
126 batch_id = Tok2VecListener.get_batch_id(docs)
127 for listener in self.listeners:
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:315, in Model.predict(self, X)
311 def predict(self, X: InT) -> OutT:
312 """Call the model's `forward` function with `is_train=False`, and return
313 only the output, instead of the `(output, callback)` tuple.
314 """
--> 315 return self._func(self, X, is_train=False)[0]
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/chain.py:54, in forward(model, X, is_train)
52 callbacks = []
53 for layer in model.layers:
---> 54 Y, inc_layer_grad = layer(X, is_train=is_train)
55 callbacks.append(inc_layer_grad)
56 X = Y
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
289 """Call the model's `forward` function, returning the output and a
290 callback to compute the gradients via backpropagation."""
--> 291 return self._func(self, X, is_train=is_train)
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/chain.py:54, in forward(model, X, is_train)
52 callbacks = []
53 for layer in model.layers:
---> 54 Y, inc_layer_grad = layer(X, is_train=is_train)
55 callbacks.append(inc_layer_grad)
56 X = Y
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
289 """Call the model's `forward` function, returning the output and a
290 callback to compute the gradients via backpropagation."""
--> 291 return self._func(self, X, is_train=is_train)
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/with_array.py:30, in forward(model, Xseq, is_train)
28 def forward(model: Model[SeqT, SeqT], Xseq: SeqT, is_train: bool):
29 if isinstance(Xseq, Ragged):
---> 30 return _ragged_forward(
31 cast(Model[Ragged, Ragged], model), cast(Ragged, Xseq), is_train
32 )
33 elif isinstance(Xseq, Padded):
34 return _padded_forward(
35 cast(Model[Padded, Padded], model), cast(Padded, Xseq), is_train
36 )
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/with_array.py:90, in _ragged_forward(model, Xr, is_train)
86 def _ragged_forward(
87 model: Model[Ragged, Ragged], Xr: Ragged, is_train: bool
88 ) -> Tuple[Ragged, Callable]:
89 layer: Model[ArrayXd, ArrayXd] = model.layers[0]
---> 90 Y, get_dX = layer(Xr.dataXd, is_train)
92 def backprop(dYr: Ragged) -> Ragged:
93 return Ragged(get_dX(dYr.dataXd), dYr.lengths)
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
289 """Call the model's `forward` function, returning the output and a
290 callback to compute the gradients via backpropagation."""
--> 291 return self._func(self, X, is_train=is_train)
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/chain.py:54, in forward(model, X, is_train)
52 callbacks = []
53 for layer in model.layers:
---> 54 Y, inc_layer_grad = layer(X, is_train=is_train)
55 callbacks.append(inc_layer_grad)
56 X = Y
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
289 """Call the model's `forward` function, returning the output and a
290 callback to compute the gradients via backpropagation."""
--> 291 return self._func(self, X, is_train=is_train)
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/chain.py:54, in forward(model, X, is_train)
52 callbacks = []
53 for layer in model.layers:
---> 54 Y, inc_layer_grad = layer(X, is_train=is_train)
55 callbacks.append(inc_layer_grad)
56 X = Y
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/model.py:291, in Model.__call__(self, X, is_train)
288 def __call__(self, X: InT, is_train: bool) -> Tuple[OutT, Callable]:
289 """Call the model's `forward` function, returning the output and a
290 callback to compute the gradients via backpropagation."""
--> 291 return self._func(self, X, is_train=is_train)
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/layernorm.py:25, in forward(model, X, is_train)
24 def forward(model: Model[InT, InT], X: InT, is_train: bool) -> Tuple[InT, Callable]:
---> 25 N, mu, var = _get_moments(model.ops, X)
26 Xhat = (X - mu) * var ** (-1.0 / 2.0)
27 Y, backprop_rescale = _begin_update_scale_shift(model, Xhat)
File ~/opt/anaconda3/lib/python3.8/site-packages/thinc/layers/layernorm.py:76, in _get_moments(ops, X)
74 def _get_moments(ops: Ops, X: Floats2d) -> Tuple[Floats2d, Floats2d, Floats2d]:
75 # TODO: Do mean methods
---> 76 mu: Floats2d = X.mean(axis=1, keepdims=True)
77 var: Floats2d = X.var(axis=1, keepdims=True) + 1e-08
78 return cast(Floats2d, ops.asarray_f([X.shape[1]])), mu, var
File ~/opt/anaconda3/lib/python3.8/site-packages/numpy/core/_methods.py:180, in _mean(a, axis, dtype, out, keepdims, where)
177 dtype = mu.dtype('f4')
178 is_float16_result = True
--> 180 ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
181 if isinstance(ret, mu.ndarray):
182 ret = um.true_divide(
183 ret, rcount, out=ret, casting='unsafe', subok=False)
KeyboardInterrupt:
print(len(corpus))
[doc._.preview for doc in corpus[:5]]
79
['Doc(13007 tokens: "Respondent New York City taxes the possession o...")',
'Doc(72325 tokens: "As amended by §203 of the Bipartisan Campaign R...")',
'Doc(8333 tokens: "Under 28 U. S. C. §2254(d)(2), a federal court ...")',
'Doc(9947 tokens: "The Illegal Immigration Reform and Immigrant Re...")',
'Doc(5508 tokens: "Per Curiam. From beginning to end, judicial pr...")']
We can see that the type of each item in the corpus is a Doc
- this is a processed spaCy output document, with all of the extracted features. textacy provides some capacity to work with those features via its API, and also exposes new document-level features, such as ngrams and algorithms to determine a document’s readability level, among others.
We can filter this corpus based on metadata attributes.
corpus[9]._.meta
{'issue': '90520',
'issue_area': 9,
'n_min_votes': 0,
'case_name': 'THE HERTZ CORPORATION v. MELINDA FRIEND et al.',
'maj_opinion_author': 110,
'decision_date': '2010-02-23',
'decision_direction': 'liberal',
'n_maj_votes': 9,
'us_cite_id': '559 U.S. 77',
'argument_date': '2009-11-10'}
# Here we'll find all the cases where the number of justices voting in the majority was greater than 6.
supermajorities = [doc for doc in corpus.get(lambda doc: doc._.meta["n_maj_votes"] > 6)]
len(supermajorities)
53
supermajorities[0]._.preview
'Doc(8333 tokens: "Under 28 U. S. C. §2254(d)(2), a federal court ...")'
Finding important words in the corpus#
print("number of documents: ", corpus.n_docs)
print("number of sentences: ", corpus.n_sents)
print("number of tokens: ", corpus.n_tokens)
number of documents: 79
number of sentences: 42470
number of tokens: 1189205
?corpus.word_counts
corpus.word_counts(by="orth_", filter_stops=False, filter_punct=False, filter_nums=False)
{'Respondent': 250,
'New': 413,
'York': 284,
'City': 340,
'taxes': 56,
'the': 56729,
'possession': 216,
'of': 30659,
'cigarettes': 23,
'.': 44541,
'Petitioner': 246,
'Hemi': 82,
'Group': 37,
',': 72059,
'based': 526,
'in': 14744,
'Mexico': 16,
'sells': 6,
'online': 49,
'to': 25427,
'residents': 25,
'Neither': 74,
'state': 1920,
'nor': 242,
'city': 48,
'law': 1899,
'requires': 352,
'out': 548,
'-': 10949,
'sellers': 13,
'such': 1571,
'as': 5683,
'charge': 102,
'collect': 35,
'or': 5809,
'remit': 7,
"'s": 9728,
'tax': 175,
';': 4864,
'instead': 175,
'must': 1197,
'recover': 50,
'its': 2590,
'on': 5214,
'sales': 48,
'directly': 134,
'from': 2842,
'purchasers': 16,
'But': 1066,
'Jenkins': 51,
'Act': 1278,
'15': 449,
'U.': 5900,
'S.': 6072,
'C.': 1321,
'§': 6375,
'375': 37,
'378': 36,
'submit': 57,
'customer': 9,
'information': 310,
'States': 2256,
'into': 572,
'which': 2646,
'they': 1302,
'ship': 22,
'and': 15113,
'State': 1262,
'has': 2191,
'agreed': 176,
'forward': 51,
'that': 18373,
'That': 507,
'helps': 26,
'track': 19,
'down': 115,
'cigarette': 14,
'who': 1201,
'do': 1015,
'not': 8499,
'pay': 124,
'their': 1507,
'Against': 27,
'backdrop': 10,
'filed': 402,
'this': 3792,
'lawsuit': 51,
'under': 2030,
'Racketeer': 6,
'Influenced': 6,
'Corrupt': 6,
'Organizations': 11,
'(': 14766,
'RICO': 99,
')': 18634,
'alleging': 33,
'failure': 216,
'file': 195,
'reports': 84,
'with': 3669,
'constituted': 41,
'mail': 76,
'wire': 37,
'fraud': 328,
'are': 2675,
'defined': 144,
'"': 30807,
'racketeering': 5,
'activit[ies': 1,
']': 3140,
'18': 499,
'1961(1': 4,
'subject': 502,
'enforcement': 148,
'civil': 198,
'1964(c': 12,
'The': 5952,
'District': 1052,
'Court': 5729,
'dismissed': 61,
'claims': 611,
'but': 1669,
'Second': 481,
'Circuit': 695,
'vacated': 47,
'judgment': 1061,
'remanded': 129,
'Among': 37,
'other': 1491,
'things': 81,
'Appeals': 794,
'held': 625,
'asserted': 113,
'injury': 174,
'—': 1921,
'lost': 60,
'revenue': 51,
'came': 32,
'about': 767,
'by': 4551,
'reason': 416,
'predicate': 75,
'frauds': 23,
'It': 1106,
'accordingly': 30,
'determined': 150,
'had': 1656,
'stated': 232,
'a': 18294,
'valid': 136,
'claim': 938,
'Held': 68,
':': 1401,
'is': 9354,
'reversed': 181,
'case': 1985,
'541': 78,
'F.': 1407,
'3d': 1008,
'425': 32,
'Chief': 146,
'Justice': 872,
'Roberts': 124,
'delivered': 157,
'opinion': 1212,
'part': 695,
'concluding': 116,
'because': 1303,
'can': 1523,
'show': 217,
'it': 4838,
'alleged': 202,
'violation': 393,
'Pp': 393,
'5': 623,
'To': 306,
'establish': 166,
'an': 4200,
'plaintiff': 258,
'offense': 354,
'only': 1770,
'was': 3521,
"'": 4309,
'for': 7978,
'cause': 375,
'his': 1845,
'proximate': 26,
'well': 492,
'Holmes': 34,
'v.': 4345,
'Securities': 57,
'Investor': 6,
'Protection': 60,
'Corporation': 43,
'503': 63,
'258': 22,
'268': 37,
'Proximate': 2,
'purposes': 294,
'should': 1012,
'be': 4659,
'evaluated': 12,
'light': 286,
'common': 264,
'foundations': 7,
'thus': 437,
'some': 736,
'direct': 161,
'relation': 65,
'between': 624,
'injurious': 5,
'conduct': 686,
'Ibid': 461,
'A': 857,
'link': 22,
'too': 179,
'remote': 23,
'purely': 34,
'contingent': 12,
'indirec[t': 2,
'insufficient': 68,
'Id.': 1254,
'at': 9800,
'271': 14,
'274': 39,
'causal': 12,
'theory': 205,
'satisfy': 106,
'relationship': 136,
'requirement': 376,
'Indeed': 196,
'here': 652,
'far': 204,
'more': 1214,
'attenuated': 12,
'than': 1270,
'one': 1262,
'rejected': 248,
'According': 80,
'committed': 136,
'selling': 17,
'failing': 51,
'required': 397,
'Without': 38,
'could': 1033,
'pass': 41,
'even': 910,
'if': 1636,
'been': 1138,
'so': 1158,
'inclined': 8,
'Some': 67,
'customers': 53,
'legally': 61,
'obligated': 10,
'failed': 197,
'Because': 307,
'did': 1303,
'receive': 130,
'determine': 326,
'pursue': 78,
'those': 1158,
'payment': 52,
'thereby': 85,
'injured': 30,
'amount': 258,
'portion': 78,
'back': 130,
'were': 1191,
'never': 247,
'collected': 7,
'As': 716,
'reiterated': 20,
'[': 3078,
't]he': 180,
'general': 417,
'tendency': 12,
'regard': 62,
'damages': 174,
'least': 289,
'go': 94,
'beyond': 230,
'first': 718,
'step': 94,
'i': 1008,
'd.': 975,
'272': 29,
'applies': 365,
'full': 243,
'force': 233,
'inquiries': 17,
'e.g.': 879,
'ibid': 207,
'causation': 23,
'move': 26,
'suffers': 15,
'same': 765,
'defect': 17,
'Anza': 20,
'Ideal': 19,
'Steel': 25,
'Supply': 10,
'Corp.': 207,
'547': 76,
'451': 41,
'458': 59,
'461': 50,
'where': 588,
'causing': 13,
'harm': 146,
'distinct': 73,
'giving': 86,
'rise': 60,
'see': 1707,
'disconnect': 2,
'sharper': 2,
'In': 1893,
'party': 520,
'both': 465,
'engaged': 65,
'harmful': 22,
'fraudulent': 38,
'act': 231,
'Here': 107,
'liability': 247,
'rests': 65,
'just': 333,
'separate': 209,
'actions': 258,
'carried': 38,
'parties': 678,
'extend': 71,
'situations': 32,
'defendant': 613,
'third': 159,
'made': 653,
'easier': 33,
'fourth': 41,
'taxpayer': 11,
'taxpayers': 13,
'caused': 112,
'place': 389,
'decided': 169,
'Put': 8,
'simply': 305,
'obligation': 81,
'before': 826,
'stretched': 4,
'chain': 26,
'declines': 15,
'today': 275,
'See': 3037,
'460': 41,
'9': 509,
'b': 120,
'attempts': 59,
'avoid': 218,
'conclusion': 453,
'characterizing': 9,
'merely': 201,
'systematic': 37,
'scheme': 133,
'defraud': 39,
'escape': 26,
'embraced': 14,
'all': 1495,
'indirectly': 20,
'harmed': 15,
'precedent': 165,
'would': 2822,
'become': 106,
'mere': 81,
'pleading': 45,
'rule': 715,
'makes': 276,
'clear': 455,
'compensable': 5,
'flowing': 5,
'...': 1600,
'necessarily': 133,
'acts': 187,
'supra': 994,
'457': 31,
'led': 63,
'injuries': 28,
'also': 1734,
'errs': 17,
'relying': 43,
'Bridge': 14,
'Phoenix': 3,
'Bond': 5,
'&': 499,
'Indemnity': 8,
'Co.': 462,
'553': 61,
'_': 2073,
'There': 268,
'plaintiffs': 336,
'straightforward': 45,
'involved': 144,
'easily': 77,
'identifiable': 4,
'connection': 70,
'issue': 788,
'there': 795,
'petitioners': 370,
'misrepresentations': 16,
'no': 1984,
'independent': 312,
'factors': 215,
'account[ed': 3,
'anything': 112,
'Multiple': 4,
'steps': 56,
'And': 721,
'contrast': 109,
'certainly': 79,
'10': 567,
'14': 413,
'J.': 1306,
'Scalia': 334,
'Thomas': 223,
'Alito': 154,
'JJ': 104,
'joined': 184,
'Ginsburg': 118,
'concurring': 560,
'Breyer': 176,
'dissenting': 483,
'Stevens': 381,
'Kennedy': 228,
'Sotomayor': 120,
'took': 146,
'consideration': 157,
'decision': 1001,
'HEMI': 3,
'GROUP': 3,
'LLC': 25,
'KAI': 3,
'GACHUPIN': 3,
'PETITIONERS': 68,
'CITY': 12,
'OF': 57,
'NEW': 8,
'YORK': 6,
'writ': 286,
'certiorari': 374,
'united': 168,
'states': 313,
'court': 2667,
'appeals': 221,
'second': 420,
'circuit': 174,
'January': 58,
'25': 197,
'2010': 352,
'seldom': 5,
'own': 406,
'Federal': 681,
'however': 606,
'vendors': 6,
'argues': 158,
'constitutes': 72,
'lose': 25,
'tens': 7,
'millions': 15,
'dollars': 18,
'unrecovered': 1,
'we': 1960,
'hold': 205,
'We': 899,
'therefore': 414,
'reverse': 63,
'contrary': 247,
'I': 1534,
'This': 818,
'arises': 26,
'motion': 320,
'dismiss': 90,
'accept': 144,
'true': 206,
'factual': 154,
'allegations': 57,
'amended': 139,
'complaint': 177,
'Leatherman': 1,
'Tarrant': 1,
'County': 123,
'Narcotics': 5,
'Intelligence': 4,
'Coordination': 3,
'Unit': 2,
'507': 43,
'163': 29,
'164': 30,
'1993': 96,
'authorizes': 79,
'impose': 182,
'N.': 260,
'Y.': 84,
'Unconsol': 2,
'Law': 303,
'Ann': 196,
'9436(1': 2,
'West': 131,
'Supp': 333,
'2009': 563,
'Under': 214,
'authority': 532,
'levied': 3,
'$': 227,
'1.50': 1,
'per': 215,
'pack': 3,
'each': 307,
'standard': 410,
'possessed': 24,
'within': 601,
'sale': 67,
'use': 505,
'Admin': 11,
'Code': 355,
'11': 518,
'1302(a': 1,
'2008': 437,
'Record': 81,
'A1016': 1,
'When': 243,
'buy': 10,
'seller': 12,
'responsible': 68,
'charging': 24,
'collecting': 47,
'remitting': 1,
'Tax': 17,
'471(2': 1,
'Out': 1,
'Smokes-Spirits.com': 3,
'Inc.': 546,
'432': 26,
'433': 20,
'CA2': 124,
'Instead': 129,
'recovering': 2,
'sold': 31,
'outside': 167,
'difficult': 148,
'often': 211,
'reluctant': 14,
'tough': 2,
'One': 98,
'way': 281,
'gather': 8,
'assist': 16,
'through': 426,
'63': 65,
'Stat': 412,
'884': 6,
'69': 34,
'627': 10,
'register': 74,
'report': 114,
'tobacco': 12,
'administrators': 15,
'listing': 41,
'name': 80,
'address': 246,
'quantity': 12,
'purchased': 18,
'have': 3351,
'executed': 25,
'agreement': 420,
'undertake': 32,
'cooperate': 12,
'fully': 87,
'keep': 250,
'promptly': 19,
'informed': 95,
'reference': 119,
'any': 2359,
'person': 421,
'transaction': 36,
'including': 397,
'i]nformation': 1,
'obtained': 83,
'may': 1890,
'result': 353,
'additional': 214,
'provided': 261,
'disclosure': 280,
'permissible': 48,
'existing': 105,
'laws': 397,
'agreements': 70,
'A1003': 1,
'asserts': 84,
'forwards': 1,
'A998': 1,
'Amended': 8,
'Compl': 5,
'¶54': 1,
'¶¶58': 2,
'59': 49,
'company': 103,
'does': 1700,
'alleges': 24,
'cost': 43,
'hundreds': 20,
'year': 356,
'excise': 5,
'A996': 3,
'Based': 35,
'federal': 1542,
'B': 206,
'provides': 363,
'private': 339,
'action': 606,
'a]ny': 17,
'business': 445,
'property': 652,
'section': 200,
'1962': 18,
'chapter': 48,
'Section': 267,
'turn': 125,
'contains': 75,
'criminal': 495,
'provisions': 483,
'Specifically': 22,
'1962(c': 2,
'invokes': 22,
'unlawful': 90,
'employed': 52,
'associated': 35,
'enterprise': 19,
'activities': 218,
'affect': 77,
'interstate': 88,
'commerce': 83,
'participate': 35,
'affairs': 31,
'pattern': 22,
'activity': 154,
'R]acketeering': 1,
'include': 214,
'number': 220,
'called': 103,
'two': 757,
'identifying': 41,
'constitute': 120,
'offenses': 123,
'A980': 4,
'Invoking': 5,
'suffered': 53,
'form': 197,
'terms--"by': 1,
'contest': 21,
'characterization': 26,
'violations': 109,
'actionable': 7,
'assume': 108,
'without': 736,
'deciding': 106,
'material': 172,
'serve': 111,
'determining': 172,
'owner': 40,
'officer': 105,
'Kai': 1,
'Gachupin': 3,
'individual': 316,
'duty': 194,
'Nexicon': 1,
'No': 393,
'03': 5,
'CV': 1,
'383': 38,
'DAB': 1,
'2006': 287,
'WL': 20,
'647716': 1,
'*': 387,
'7-*8': 1,
'SDNY': 24,
'Mar.': 48,
'formed': 60,
'7-*10': 1,
'ground': 196,
'whether': 1708,
'loss': 58,
'1964': 45,
'further': 411,
'proceedings': 325,
'established': 290,
'operated': 29,
'447': 50,
'448': 15,
'444': 47,
'445': 42,
'concluded': 310,
'440': 25,
'viable': 5,
'Judge': 102,
'Winter': 15,
'dissented': 11,
'petition': 473,
'asking': 53,
'Pet': 233,
'Cert': 205,
'i.': 13,
'granted': 316,
'556': 72,
'II': 307,
'Though': 26,
'framed': 16,
'single': 202,
'question': 966,
'raises': 43,
'issues': 125,
'First': 800,
'allegedly': 49,
'decide': 301,
'1992': 105,
'set': 342,
'forth': 168,
'addressed': 110,
'brought': 178,
'SIPC': 8,
'against': 799,
'defendants': 177,
'whom': 156,
'manipulated': 4,
'stock': 38,
'prices': 33,
'262': 20,
'263': 42,
'reimburse': 2,
'certain': 445,
'registered': 87,
'broker': 8,
'dealers': 47,
'event': 105,
'unable': 44,
'meet': 64,
'financial': 80,
'obligations': 52,
'261': 19,
'conspiracy': 59,
'manipulators': 1,
'detected': 2,
'collapsed': 5,
'insurer': 1,
'ultimately': 76,
'hook': 2,
'nearly': 73,
'13': 452,
'million': 63,
'cover': 87,
'conspirators': 6,
'phrase': 187,
'used': 324,
'explained': 261,
'Applying': 43,
'hand': 92,
'quoting': 424,
'Associated': 7,
'Gen.': 42,
'Contractors': 4,
'Cal': 71,
'Carpenters': 1,
'459': 44,
'519': 42,
'534': 55,
'1983': 137,
'Southern': 38,
'Pacific': 90,
'Darnell': 1,
'Taenzer': 1,
'Lumber': 2,
'245': 15,
'531': 44,
'533': 70,
'1918': 5,
'internal': 412,
'quotation': 395,
'marks': 398,
'omitted': 484,
'Our': 158,
'cases': 934,
'confirm': 38,
'slip': 349,
'op': 346,
'19': 240,
'us': 462,
'confirms': 28,
'indirect': 9,
'considered': 217,
'competitor': 10,
'National': 215,
'defrauded': 9,
'able': 116,
'undercut': 11,
'lower': 128,
'offered': 58,
'contended': 20,
'allowed': 105,
'attract': 11,
'expense': 17,
'Finding': 24,
'victim': 66,
'being': 186,
'recognized': 245,
'harms': 35,
'when': 1284,
'applicable': 184,
'offering': 10,
'entirely': 116,
'defrauding': 3,
'constituting': 33,
'Thus': 258,
'viewed': 48,
'point': 360,
'important': 232,
'nevertheless': 53,
'found': 348,
'distinction': 113,
'relevant': 386,
'sufficient': 207,
'defeat': 28,
'decline': 53,
'cf': 120,
'n.': 894,
'46': 97,
'finding': 240,
'antitrust': 30,
'context': 325,
'stems': 7,
'most': 431,
'persons': 247,
'victims': 59,
'highlighted': 9,
'better': 85,
'situated': 14,
'incentive': 33,
'sue': 49,
'269': 16,
'270': 21,
'seek': 211,
'recovery': 43,
'imposes': 87,
'2.75': 1,
'double': 34,
'what': 783,
'charges': 80,
'471(1': 1,
'opine': 2,
'bring': 71,
'Suffice': 3,
'say': 221,
'concrete': 29,
'incentives': 18,
'try': 31,
'accuses': 2,
'Anzas': 1,
'substantial': 213,
'money': 134,
'If': 429,
'expected': 39,
'appropriate': 227,
'remedies': 79,
'dissent': 470,
'foreseeability': 6,
'rather': 317,
'existence': 97,
'sufficiently': 94,
'find': 233,
'satisfied': 77,
'foreseeable': 23,
'consequence': 79,
'intended': 231,
'indeed': 87,
'desired': 13,
'falls': 55,
'risks': 36,
'Congress': 1679,
'sought': 242,
'prevent': 189,
'Post': 202,
'6': 594,
'line': 163,
'reasoning': 102,
'sounds': 10,
'familiar': 33,
'precisely': 50,
'argument': 541,
'lodged': 7,
'majority': 415,
'criticized': 16,
'view': 556,
'permit[ting': 1,
'evade': 9,
'consequences': 158,
'behavior': 76,
'470': 56,
'carry': 60,
'day': 202,
'asked': 146,
'revisit': 12,
'concepts': 6,
'course': 264,
'many': 355,
'shapes': 5,
'precedents': 218,
'make': 467,
'focus': 62,
'directness': 4,
'mention': 42,
'concept': 62,
'offers': 59,
'responses': 32,
'challenges': 140,
'our': 1019,
'Brief': 919,
'42': 177,
'Having': 25,
'broadly': 48,
'contends': 153,
'Otherwise': 13,
'example': 331,
'give': 215,
'competitive': 22,
'advantage': 34,
'over': 456,
'454': 52,
'455': 49,
'allegation': 16,
'circumvent': 10,
'claiming': 39,
'aim': 42,
'increase': 54,
'market': 167,
'share': 54,
'460.1': 1,
'moreover': 80,
'Sedima': 2,
'P.': 107,
'R.': 263,
'L.': 350,
'Imrex': 2,
'473': 60,
'479': 68,
'497': 28,
'1985': 74,
'statement': 264,
'went': 39,
'allege': 24,
'assertion': 65,
'legal': 414,
'very': 226,
'relies': 92,
'reaffirmed': 24,
'wrongful': 42,
'competing': 32,
'bidders': 7,
'county': 48,
'lien': 4,
'auction': 18,
'liens': 5,
'profitable': 3,
'lowest': 8,
'possible': 180,
'bid': 7,
'multiple': 48,
'low': 22,
'bidding': 4,
'percentage': 28,
'penalty': 200,
'bidder': 6,
'require': 330,
'0': 4,
'%': 135,
'awarded': 56,
'devised': 11,
'plan': 163,
'allocate': 3,
'rotational': 2,
'basis': 381,
'3': 747,
'noted': 256,
'created': 157,
'perverse': 9,
'Bidders': 1,
'addition': 121,
'themselves': 141,
'sen[t': 1,
'agents': 19,
'behalf': 110,
'obtain': 106,
'disproportionate': 29,
'prohibited': 108,
...}
def show_doc_counts(input_corpus, weighting, limit=20):
doc_counts = input_corpus.word_doc_counts(weighting=weighting, filter_stops=True, by="orth_")
print("\n".join(f"{a:15} {b}" for a, b in sorted(doc_counts.items(), key=lambda x:x[1], reverse=True)[:limit]))
word_doc_counts
provides a few ways of quantifying the prevalence of individual words across the corpus: whether a word appears many times in most documents, just a few times in a few documents, many times in a few documents, or just a few times in most documents.
print("# DOCS APPEARING IN / TOTAL # DOCS", "\n", "-----------", sep="")
show_doc_counts(corpus, "freq")
print("\n", "LOG(TOTAL # DOCS / # DOCS APPEARING IN)", "\n", "-----------", sep="")
show_doc_counts(corpus, "idf")
# DOCS APPEARING IN / TOTAL # DOCS
-----------
Court 0.9873417721518988
case 0.9873417721518988
certiorari 0.9873417721518988
granted 0.9873417721518988
U. 0.9746835443037974
S. 0.9746835443037974
judgment 0.9746835443037974
v. 0.9746835443037974
decision 0.9746835443037974
court 0.9746835443037974
C. 0.9620253164556962
held 0.9620253164556962
opinion 0.9620253164556962
2009 0.9620253164556962
1 0.9620253164556962
F. 0.9493670886075949
Justice 0.9493670886075949
Id. 0.9493670886075949
e.g. 0.9493670886075949
issue 0.9493670886075949
LOG(TOTAL # DOCS / # DOCS APPEARING IN)
-----------
cigarettes 4.382026634673881
Hemi 4.382026634673881
cigarette 4.382026634673881
RICO 4.382026634673881
racketeering 4.382026634673881
activit[ies 4.382026634673881
1964(c 4.382026634673881
Proximate 4.382026634673881
indirec[t 4.382026634673881
Anza 4.382026634673881
Ideal 4.382026634673881
disconnect 4.382026634673881
sharper 4.382026634673881
Phoenix 4.382026634673881
account[ed 4.382026634673881
HEMI 4.382026634673881
GROUP 4.382026634673881
KAI 4.382026634673881
GACHUPIN 4.382026634673881
YORK 4.382026634673881
textacy provides implementations of algorithms for identifying words and phrases that are representative of a document (aka keyterm extraction).
from textacy.extract import keyterms as ke
# corpus[0].text
# Run the Yake algorithim (Campos et al., 2018) on a given document
key_terms_yake = ke.yake(corpus[0])
key_terms_yake
[('New York City', 0.002288298045327596),
('New York State', 0.0060401030525529436),
('U. S. C.', 0.0075325125188752005),
('Jenkins Act', 0.012460549374297763),
('York City customer', 0.021988206972901634),
('RICO', 0.027329026591127712),
('Hemi Group', 0.03384924285412936),
('York City cigarette', 0.03513697406836725),
('Jenkins Act information', 0.03892293549952245),
('RICO claim', 0.042756731344701926)]
Keyword in context#
Sometimes researchers find it helpful just to see a particular keyword in context.
for doc in corpus[:5]:
print("\n", doc._.meta.get('case_name'), "\n", "-" * len(doc._.meta.get('case_name')), "\n")
for match in textacy.extract.kwic.keyword_in_context(doc.text, "judgment"):
print(" ".join(match).replace("\n", " "))
HEMI GROUP, LLC AND KAI GACHUPIN v. CITY OF NEW YORK, NEW YORK
--------------------------------------------------------------
ed the claims, but the Second Circuit vacated the judgment and remanded. Among other things, the Court of Ap
the City had stated a valid RICO claim. Held: The judgment is reversed, and the case is remanded. 541 F. 3d
opinion concurring in part and concurring in the judgment . Breyer, J., filed a dissenting opinion, in which
The Second Circuit vacated the District Court's judgment and remanded for further proceedings. The Court o
it. The City, therefore, has no RICO claim. The judgment of the Court of Appeals for the Second Circuit is
insburg, concurring in part and concurring in the judgment . As the Court points out, this is a case "about
he above-stated view, and I concur in the Court's judgment . HEMI GROUP, LLC and KAI GACHUPIN, PETITIONERS v.
CITIZENS UNITED v. FEDERAL ELECTION COMMISSION
----------------------------------------------
ppellee Federal Election Commission (FEC) summary judgment . Held: 1. Because the question whether §441b app
(Scalia, J., concurring in part and concurring in judgment ). We agree with that conclusion and hold that sta
t later convened to hear the cause. The resulting judgment gives rise to this appeal. Citizens United has a
m), and then granted the FEC's motion for summary judgment , App. 261a-262a. See id., at 261a ("Based on the
or opinion, we find that the [FEC] is entitled to judgment as a matter of law. See Citizen[s] United v. FEC,
onnell, supra, at 339 (Kennedy, J., concurring in judgment in part and dissenting in part). The Snowe-Jeffor
rt's later opinion, which granted the FEC summary judgment , was "[b]ased on the reasoning of [its] prior opi
62 (Scalia, J., concurring in part, concurring in judgment in part, and dissenting in part); id., at 273-275
part, concurring in result in part, concurring in judgment in part, and dissenting in part); id., at 322-338
pore over each word of a text to see if, in their judgment , it accords with the 11-factor test they have pro
er, 502 U. S., at 124 (Kennedy, J., concurring in judgment ), the quoted language from WRTL provides a suffic
toral opportunities means making and implementing judgment s about which strengths should be permitted to con
issenting); id., at 773 (White, J., concurring in judgment ). With the advent of the Internet and the decline
S. 334, 360-361 (1995) (Thomas, J., concurring in judgment ). Yet television networks and major newspapers ow
t 341-343; id., at 367 (Thomas, J., concurring in judgment ). At the founding, speech was open, comprehensive
endent expenditures; if they surrender their best judgment ; and if they put expediency before principle, the
ation's course; still others simply might suspend judgment on these points but decide to think more about is
nell, supra, at 341 (opinion of Kennedy, J.). The judgment of the District Court is reversed with respect to
ctions on corporate independent expenditures. The judgment is affirmed with respect to BCRA's disclaimer and
Thomas, JJ., concurring in part and concurring in judgment ); McConnell, 540 U. S., at 247, 264, 286 (opinion
86 (Thomas, J., concurring in part, concurring in judgment in part, and dissenting in part). These readings
3) (Scalia, J., concurring in part, concurring in judgment in part, and dissenting in part) (quoting C. Cook
U. S. 334, 360 (1995) (Thomas, J., concurring in judgment ); see also McConnell, 540 U. S., at 252-253 (opin
an affirmative answer to that question is, in my judgment , profoundly misguided. Even more misguided is the
Comm. (NRWC), and have accepted the "legislative judgment that the special characteristics of the corporate
of §203. App. 23a-24a. In its motion for summary judgment , however, Citizens United expressly abandoned its
Roberts, J., concurring in part and concurring in judgment ). Consider just three of the narrower grounds of
s of longstanding practice and Congress' reasoned judgment that certain regulations which leave "untouched f
precedents "represent respect for the legislative judgment that the special characteristics of the corporate
oach taken by the majority cannot be right, in my judgment . It disregards our constitutional history and the
erting an " 'undue influence on an officeholder's judgment ' " and from creating " 'the appearance of such in
orations). When the McConnell Court affirmed the judgment of the District Court regarding §203, we did not
ll, 540 U. S., at 306 (Kennedy, J., concurring in judgment in part and dissenting in part); see also id., at
63 (Scalia, J., concurring in part, concurring in judgment in part, and dissenting in part), a disreputable
nted where, as here, we deal with a congressional judgment that has remained essentially unchanged throughou
tisfy heightened judicial scrutiny of legislative judgment s will vary up or down with the novelty and plausi
years of bipartisan deliberation and its reasoned judgment on this basis, without first confirming that the
J., dissenting). "In the meantime, a legislative judgment that 'enough is enough' should command the greate
Congress' factual findings and its constitutional judgment : It acknowledges the validity of the interest in
O, 335 U. S., at 144 (Rutledge, J., concurring in judgment )), and this, in turn, "interferes with the 'open
he expansive protections afforded by the business judgment rule. Blair & Stout 320; see also id., at 298-315
relevance of established facts and the considered judgment s of state and federal legislatures over many deca
corporate money in politics. I would affirm the judgment of the District Court. CITIZENS UNITED, APPELLANT
3) (Thomas, J., concurring in part, concurring in judgment in part, and dissenting in part) (internal quotat
64 (Thomas, J., concurring in part, concurring in judgment in part, and dissenting in part) (quoting Nixon v
ordingly, I respectfully dissent from the Court's judgment upholding BCRA §§201 and 311. FOOTNOTESFootnote 1
al Election Commission's (FEC) motion for summary judgment , App. 261a-262a, any question about statutory val
done "on the basis of entirely subjective, ad hoc judgment s," 523 U. S., at 690, that suggested anticompetit
539 U. S., at 163-164 (Kennedy, J., concurring in judgment ). Both Courts also heard criticisms of Austin fro
concurring in part and dissenting in part). In my judgment , such limitations may be justified to the extent
"We should defer to [the legislature's] political judgment that unlimited spending threatens the integrity o
HOLLY WOOD, PETITIONER v. RICHARD F. ALLEN, COMMISSIONER, ALABAMA DEPARTMENT OF CORRECTIONS, et al.
---------------------------------------------------------------------------------------------------
de a strategic decision, but to whether counsel's judgment was reasonable, a question not before this Court.
onship to §2254(e)(1). Accordingly, we affirm the judgment of the Court of Appeals on that basis. I In 1993
umption that counsel exercised sound professional judgment , supported by ample reasons, not to present the i
rategic decision, but rather to whether counsel's judgment was reasonable — a question we do not reach. See
ons were an unreasonable exercise of professional judgment and constituted deficient performance under Stric
itself was a reasonable exercise of professional judgment under Strickland or whether the application of St
mination of the facts. Accordingly, we affirm the judgment of the Court of Appeals for the Eleventh Circuit.
ess. That was a strategic decision based on their judgment that the evidence would do more harm than good. B
resulted from inattention, not reasoned strategic judgment "); Strickland, 466 U. S., at 690-691. Moreover, "
itself was a reasonable exercise of professional judgment under Strickland or whether the application of St
AGRON KUCANA v. ERIC H. HOLDER, JR., ATTORNEY GENERAL
-----------------------------------------------------
. (1) The amicus defending the Seventh Circuit's judgment urges that regulations suffice to trigger §1252(a
laces within the no-judicial-review category "any judgment regarding the granting of relief under section 11
ed. Alito, J., filed an opinion concurring in the judgment . AGRON KUCANA, PETITIONER v. ERIC H. HOLDER,Jr.,
ubparagraph (D),[1] and regardless of whether the judgment , decision, or action is made in removal proceedin
urt shall have jurisdiction to review-- "(i) any judgment regarding the granting of relief under section 11
micus curiae, in support of the Seventh Circuit's judgment . 557 U. S. ___ (2009). Ms. Leiter has ably discha
mmigration decisions to motions for relief from a judgment under Federal Rule of Civil Procedure 60(b)). Fed
Nevertheless, in defense of the Seventh Circuit's judgment , amicus urges that regulations suffice to trigger
f for Court-Appointed Amicus Curiae in Support of Judgment Below 15, 17 (citing, inter alia, Florida Dept. o
laces within the no-judicial-review category "any judgment regarding the granting of relief under section 11
To the clause (i) enumeration of administrative judgment s that are insulated from judicial review, Congres
dicial review. * * * For the reasons stated, the judgment of the United States Court of Appeals for the Sev
nuary 20, 2010] Justice Alito, concurring in the judgment . I agree that the Court of Appeals had jurisdict
f for Court-Appointed Amicus Curiae in Support of Judgment Below 41-42. Amicus' argument is ingenious but u
f for Court-Appointed Amicus Curiae in Support of Judgment Below 19, n. 8 (quoting §1229a(c)(7)(B)). One can
f for Court-Appointed Amicus Curiae in Support of Judgment Below. In every one of those examples, Congress e
f for Court-Appointed Amicus Curiae in Support of Judgment Below 21-23. But §1252(a)(2)(B)(ii) does not say
Congress want to exclude review for discretionary judgment s by the Attorney General that are recited explici
y in the statute, but provide judicial review for judgment s that are just as lawfully discretionary because
f for Court-Appointed Amicus Curiae in Support of Judgment Below 32-34. The report states that §1252(a)(2)(B
MARCUS A. WELLONS v. HILTON HALL, WARDEN
----------------------------------------
rd for an order granting certiorari, vacating the judgment below, and remanding the case (GVR) remains as it
ave to proceed in forma pauperis are granted. The judgment is vacated, and the case is remanded to the Eleve
nts Wellons' petition for certiorari, vacates the judgment of the Eleventh Circuit, and remands ("GVRs") in
erse or set the case for argument; otherwise, the judgment below must stand. The same is true if (as the Cou
rits question. If they erred in that regard their judgment should be reversed rather than remanded "in light
we are, to vacate and send back their authorized judgment s for inconsequential imperfection of opinion — as
authority or development that casts doubt on the judgment of the court below. What the Court has done — usi
t of 1996 (AEDPA) to the "Georgia Supreme Court's judgment as to the substance and effect of the ex parte co
Vectorization#
Let’s continue with corpus-level analysis by taking advantage of textacy’s vectorizer class, which wraps functionality from scikit-learn
to count the prevalence of certain tokens in each document of the corpus and to apply weights to these counts if desired. We could just work directly in scikit-learn
, but it can be nice for mental overhead to learn one library and be able to do a great deal with it.
We’ll create a vectorizer, sticking with the normal term frequency defaults but discarding words that appear in fewer than 3 documents or more than 95% of documents. We’ll also limit our features to the top 500 words according to document frequency. This means our feature set, or columns, will have a higher degree of representation across the corpus. We could further scale these counts according to document frequency (or inverse document frequency) weights, or normalize the weights so that they add up to 1 for each document row (L1 norm), and so on.
import textacy.representations
vectorizer = textacy.representations.Vectorizer(min_df=3, max_df=.95, max_n_terms=500)
tokenized_corpus = [[token.orth_ for token in list(textacy.extract.words(doc, filter_nums=True, filter_stops=True, filter_punct=True))] for doc in corpus]
dtm = vectorizer.fit_transform(tokenized_corpus)
dtm
<79x500 sparse matrix of type '<class 'numpy.int32'>'
with 22870 stored elements in Compressed Sparse Row format>
We have now have a matrix representation of our corpus, where rows are documents, and columns (or features) are words from the corpus. The value at any given point is the number of times that the word appears in that document. Once we have a document-term matrix, we could do several things with it just within textacy, though we also can pass it into different algorithms within scikit-learn
or other libraries.
# Let's look at some of the terms
vectorizer.terms_list[:20]
['$',
'2d',
'3d',
'A.',
'AEDPA',
'Act',
'Amendment',
'American',
'Ann',
'Ante',
'App',
'Appeals',
'B',
'Board',
'Breyer',
'Brief',
'Cert',
'Cf',
'Circuit',
'Citizens']
We can see that we are still getting a number of terms which might be filtered out, such as symbols and abbreviations. The most straightforward solutions are to filter the terms against a dictionary during vectorization, which carries the risk of inadvertently filtering words that you’d prefer to keep in the dataset, or curating a custom stopword list, which can be inflexible and time consuming. Otherwise, it is often the case that the corpus analysis tools used with the vectorized texts (e.g., topic modeling or stylistic analysis – see below) have ways of recognizing and sequestering unwanted terms so that they can be excluded from the results if desired.
Exercise - topic modeling#
Read through the below code to quickly look at one example of what we can do with a vectorized corpus. Topic modeling is very popular for semantic exploration of texts, and there are numerous implementations of it. Textacy uses implementations from scikit-learn. Our corpus is rather small for topic modeling, but just to see how it’s done here, we’ll go ahead. First, though, topic modeling works best when the texts are divided into approximately equal-sized “chunks.” A quick word-count of the corpus will show that the decisions are of quite variable lengths, which will skew the topic model.
for doc in corpus:
print(f"{len(doc): >5} {doc._.meta['case_name'][:80]}")
13007 HEMI GROUP, LLC AND KAI GACHUPIN v. CITY OF NEW YORK, NEW YORK
72325 CITIZENS UNITED v. FEDERAL ELECTION COMMISSION
8333 HOLLY WOOD, PETITIONER v. RICHARD F. ALLEN, COMMISSIONER, ALABAMA DEPARTMENT OF
9947 AGRON KUCANA v. ERIC H. HOLDER, JR., ATTORNEY GENERAL
5508 MARCUS A. WELLONS v. HILTON HALL, WARDEN
4083 ERIC PRESLEY v. GEORGIA
7609 NRG POWER MARKETING, LLC, et al. v. MAINE PUBLIC UTILITIES COMMISSION et al.
6983 E. K. MCDANIEL, WARDEN, et al. v. TROY BROWN
13844 MARYLAND v. MICHAEL BLAINE SHATZER, SR.
8945 THE HERTZ CORPORATION v. MELINDA FRIEND et al.
11605 FLORIDA v. KEVIN DEWAYNE POWELL
2343 RICK THALER, DIRECTOR, TEXAS DEPARTMENT OF CRIMINAL JUSTICE, CORRECTIONAL INSTIT
8531 MAC'S SHELL SERVICE, INC., et al. v. SHELL OIL PRODUCTS CO. LLC et al.
8259 REED ELSEVIER, INC., et al., v. IRVIN MUCHNICK et al.
9502 CURTIS DARNELL JOHNSON v. UNITED STATES
281 JAMAL KIYEMBA et al. v. BARACK H. OBAMA, PRESIDENT OF THE UNITED STATES et al.
12424 MILAVETZ, GALLOP & MILAVETZ, P. A., et al. v. UNITED STATES
13983 TAYLOR JAMES BLOATE v. UNITED STATES
31066 SHADY GROVE ORTHOPEDIC ASSOCIATES, P. A. v. ALLSTATE INSURANCE COMPANY
15135 JOSE PADILLA v. KENTUCKY
8236 JERRY N. JONES, et al. v. HARRIS ASSOCIATES L. P.
8634 MARY BERGHUIS, WARDEN v. DIAPOLIS SMITH
15018 GRAHAM COUNTY SOIL AND WATER CONSERVATION DISTRICT, et al. v. UNITED STATES ex r
8025 UNITED STUDENT AID FUNDS, INC. v. FRANCISCO J. ESPINOSA
5499 ESTHER HUI, et al. v. YANIRA CASTANEDA, AS PERSONAL REPRESENTATIVE OF THE ESTATE
14398 PAUL RENICO, WARDEN v. REGINALD LETT
27245 KEN L. SALAZAR, SECRETARY OF THE INTERIOR, et al. v. FRANK BUONO
16003 STOLT-NIELSEN S. A., et al. v. ANIMALFEEDS INTERNATIONAL CORP.
12015 MERCK & CO., INC., et al. v. RICHARD REYNOLDS et al.
25675 KAREN L. JERMAN v. CARLISLE, MCNELLIE, RINI, KRAMER & ULRICH LPA, et al.
12618 SONNY PERDUE, GOVERNOR OF GEORGIA, et al. v. KENNY A., BY HIS NEXT FRIEND LINDA
19636 UNITED STATES v. ROBERT J. STEVENS
19849 TIMOTHY MARK CAMERON ABBOTT v. JACQUELYN VAYE ABBOTT
31417 TERRANCE JAMAR GRAHAM v. FLORIDA
23340 UNITED STATES v. GRAYDON EARL COMSTOCK, JR., et al.
20 JOE HARRIS SULLIVAN v. FLORIDA
8980 AMERICAN NEEDLE, INC. v. NATIONAL FOOTBALL LEAGUE et al.
5179 ARTHUR L. LEWIS, JR., et al. v. CITY OF CHICAGO, ILLINOIS
9721 UNITED STATES v. MARTIN O'BRIEN AND ARTHUR BURGESS
6672 BRIDGET HARDT v. RELIANCE STANDARD LIFE INSURANCE COMPANY
5771 UNITED STATES v. GLENN MARCUS
4438 JOHN ROBERTSON v. UNITED STATES ex rel. WYKENNA WATSON
8687 LAWRENCE JOSEPH JEFFERSON v. STEPHEN UPTON, WARDEN
10645 MOHAMED ALI SAMANTAR v. BASHE ABDI YOUSUF et al.
17847 MARY BERGHUIS, WARDEN v. VAN CHESTER THOMPKINS
10183 RICHARD A. LEVIN, TAX COMMISSIONER OF OHIO v. COMMERCE ENERGY, INC., et al.
15010 THOMAS CARR v. UNITED STATES
12933 MICHAEL GARY BARBER, et al. v. J. E. THOMAS, WARDEN
13676 JAN HAMILTON, CHAPTER 13 TRUSTEE v. STEPHANIE KAY LANNING
8312 WANDA KRUPSKI v. COSTA CROCIERE S. P. A.
9987 JOSE ANGEL CARACHURI-ROSENDO v. ERIC H. HOLDER, JR., ATTORNEY GENERAL
8029 MICHAEL J. ASTRUE, COMMISSIONER OF SOCIAL SECURITY v. CATHERINE G. RATLIFF
10821 BRIAN RUSSELL DOLAN v. UNITED STATES
17847 ALBERT HOLLAND v. FLORIDA
11172 NEW PROCESS STEEL, L. P. v. NATIONAL LABOR RELATIONS BOARD
18022 STOP THE BEACH RENOURISHMENT, INC. v. FLORIDA DEPARTMENT OF ENVIRONMENTAL PROTEC
9320 CITY OF ONTARIO, CALIFORNIA, et al. v. JEFF QUON et al.
18128 WILLIAM G. SCHWAB v. NADEJDA REILLY
13554 PERCY DILLON v. UNITED STATES
25016 ERIC H. HOLDER, JR., ATTORNEY GENERAL, et al. v. HUMANITARIAN LAW PROJECT et al.
10471 RENT-A-CENTER, WEST, INC. v. ANTONIO JACKSON
20738 KAWASAKI KISEN KAISHA LTD. et al. v. REGAL-BELOIT CORP. et al.
18474 MONSANTO COMPANY, et al. v. GEERTSON SEED FARMS et al.
23994 JOHN DOE #1, et al. v. SAM REED, WASHINGTON SECRETARY OF STATE, et al.
16696 ROBERT MORRISON, et al. v. NATIONAL AUSTRALIA BANK LTD. et al.
14013 GRANITE ROCK COMPANY v. INTERNATIONAL BROTHERHOOD OF TEAMSTERS et al.
15751 BILLY JOE MAGWOOD v. TONY PATTERSON, WARDEN, et al.
48969 JEFFREY K. SKILLING v. UNITED STATES
4629 CONRAD M. BLACK, JOHN A. BOULTBEE, AND MARK S. KIPNIS v. UNITED STATES
39727 FREE ENTERPRISE FUND AND BECKSTEAD AND WATTS, LLP v. PUBLIC COMPANY ACCOUNTING O
29263 BERNARD L. BILSKI AND RAND A. WARSAW v. DAVID J. KAPPOS, UNDER SECRETARY OF COMM
33725 CHRISTIAN LEGAL SOCIETY CHAPTER OF THE UNIVERSITY OF CALIFORNIA, HASTINGS COLLEG
87587 OTIS MCDONALD, et al. v. CITY OF CHICAGO, ILLINOIS, et al.
8126 DEMARCUS ALI SEARS v. STEPHEN UPTON, WARDEN
2242 BILL K. WILSON, SUPERINTENDENT, INDIANA STATE PRISON, PETITIONER v. JOSEPH E. CO
8986 KEVIN ABBOTT, PETITIONER v. UNITED STATES
4244 LOS ANGELES COUNTY, CALIFORNIA, PETITIONER v. CRAIG ARTHUR HUMPHRIES et al.
29 COSTCO WHOLESALE CORPORATION, PETITIONER v. OMEGA, S.A.
10250 KEITH SMITH, WARDEN v. FRANK G. SPISAK, JR.
We’ll re-chunk the texts into documents of not more than 500 words and then recompute the document-term matrix.
chunked_corpus_unflattened = [
[text[x:x+500] for x in range(0, len(text), 500)] for text in tokenized_corpus
]
chunked_corpus = list(itertools.chain.from_iterable(chunked_corpus_unflattened))
chunked_dtm = vectorizer.fit_transform(chunked_corpus)
chunked_dtm
<1006x500 sparse matrix of type '<class 'numpy.int32'>'
with 91636 stored elements in Compressed Sparse Row format>
import textacy.tm
model = textacy.tm.TopicModel("lda", n_topics=15)
model.fit(chunked_dtm)
doc_topic_matrix = model.transform(chunked_dtm)
for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, top_n=10):
print(f"{topic_idx: >2} {model.topic_weights(doc_topic_matrix)[topic_idx]: >3.0%}", "|", ", ".join(top_terms))
0 8% | right, rights, Clause, States, Justice, state, Amendment, bear, Constitution, law
1 4% | fees, carrier, party, attorney, fee, award, services, filed, $, Rule
2 4% | rights, child, right, Convention, State, custody, Ann, States, Stat, A.
3 4% | debtor, income, value, felony, delay, time, claimed, Code, property, exempt
4 8% | sentence, sentencing, time, life, habeas, year, State, federal, years, application
5 5% | counsel, attorney, Miranda, state, suspect, interrogation, police, right, evidence, advice
6 1% | business, Director, Office, General, place, corporation, patent, Fed, method, State
7 4% | Footnote, arbitration, agreement, parties, contract, dispute, clause, Inc., question, Brief
8 6% | process, petition, disclosure, plaintiffs, challenge, test, applied, referendum, claim, State
9 11% | Amendment, speech, Hastings, J., public, political, interest, policy, Government, corporations
10 9% | jury, trial, Id., jurors, d., evidence, District, App, judge, reasonable
11 11% | Congress, Board, States, United, statute, power, Act, Commission, authority, foreign
12 8% | F., United, States, 3d, statute, Congress, law, error, criminal, conduct
13 12% | state, federal, law, courts, Act, action, class, Rule, claims, City
14 3% | cross, debt, District, injunction, relief, land, transfer, Government, bankruptcy, agency
[(terrible, -0.8), (awful, -0.78), (fantastic, 0.9), (bicycle, 0.01), (pizza, 0.02), (super, 0.85)]
Document similarity with word2vec and clustering#
spaCy and textacy provide several built-in methods for measuring the degree of similarity between two documents, including a word2vec
-based approach that computes the semantic similarity between documents based on the word vector model included with the spaCy language model. This technique is capable of inferring, for example, that two documents are topically related even if they don’t share any words but use synonyms for a shared concept.
To evaluate this similarity comparison, we’ll compute the similarity of each pair of docs in the corpus, and then branch out into scikit-learn
a bit to look for clusters based on these similarity measurements.
import numpy as np
dim = corpus.n_docs
distance_matrix = np.zeros((dim,dim))
for i, doc_i in enumerate(corpus):
for j, doc_j in enumerate(corpus):
if i == j:
continue # defaults to 0
if i > j:
distance_matrix[i,j] = distance_matrix[j,i]
else:
distance_matrix[i,j] = 1 - doc_i.similarity(doc_j)
distance_matrix
array([[0. , 0.00428384, 0.00913359, ..., 0.0036863 , 0.05862846,
0.00781636],
[0.00428384, 0. , 0.00910768, ..., 0.00630599, 0.05399509,
0.0058778 ],
[0.00913359, 0.00910768, 0. , ..., 0.00614726, 0.04253501,
0.00566709],
...,
[0.0036863 , 0.00630599, 0.00614726, ..., 0. , 0.05483739,
0.00724887],
[0.05862846, 0.05399509, 0.04253501, ..., 0.05483739, 0. ,
0.04504147],
[0.00781636, 0.0058778 , 0.00566709, ..., 0.00724887, 0.04504147,
0. ]])
The OPTICS hierarchical density-based clustering algorithm only finds one cluster with its default settings, but an examination of the legal issue types coded to each decision indicates that the word2vec
-based clustering has indeed produced a group of semantically related documents.
from sklearn.cluster import OPTICS
clustering = OPTICS(metric='precomputed').fit(distance_matrix)
print(clustering.labels_)
[-1 0 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 -1 0 -1 -1 -1
-1 1 -1 -1 -1 0 -1 -1 -1 -1 0 -1 -1 -1 -1 -1 1 -1 -1 0 1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 -1 -1 -1 0 0 -1 -1 1 -1 -1 -1 0
0 1 -1 -1 -1 -1 1]
from itertools import groupby
clusters = groupby(sorted(enumerate(clustering.labels_), key=lambda x: x[1]), lambda x: x[1])
for cluster_label, docs in clusters:
if cluster_label == -1:
continue
print(f"Cluster {cluster_label}", "\n---------")
print("\n".join(
f"{corpus[i]._.meta['us_cite_id']: <12} | {data.issue_area_codes[corpus[i]._.meta['issue_area']]: <18}"
f" | {data.issue_codes[corpus[i]._.meta['issue']][:60]}"
for i, _ in docs
))
print("\n\n")
Cluster 0
---------
558 U.S. 310 | First Amendment | campaign spending (cf. governmental corruption):
559 U.S. 393 | Judicial Power | Federal Rules of Civil Procedure including Supreme Court Rul
559 U.S. 335 | Economic Activity | federal or state regulation of securities
559 U.S. 573 | Civil Rights | debtors' rights
560 U.S. 126 | Federalism | national supremacy: miscellaneous
560 U.S. 305 | Economic Activity | liability, other than as in sufficiency of evidence, electio
561 U.S. 1 | First Amendment | federal or state internal security legislation: Smith, Inter
561 U.S. 186 | Privacy | Freedom of Information Act and related federal or state stat
561 U.S. 247 | Economic Activity | federal or state regulation of securities
561 U.S. 661 | First Amendment | free exercise of religion
561 U.S. 742 | Criminal Procedure | miscellaneous criminal procedure (cf. due process, prisoners
Cluster 1
---------
558 U.S. 220 | Criminal Procedure | discovery and inspection (in the context of criminal litigat
559 U.S. 98 | Criminal Procedure | Miranda warnings
559 U.S. 766 | Criminal Procedure | habeas corpus
560 U.S. 258 | Criminal Procedure | Federal Rules of Criminal Procedure
560 U.S. 370 | Criminal Procedure | Miranda warnings
561 U.S. 358 | Criminal Procedure | statutory construction of criminal laws: fraud
561 U.S. 945 | Criminal Procedure | right to counsel (cf. indigents appointment of counsel or in
558 U.S. 139 | Criminal Procedure | cruel and unusual punishment, death penalty (cf. extra legal
clean
['assyrian',
'monarchs',
'especially',
'sardanapalus',
'babylon',
'scene',
'great',
'intellectual',
'activity',
'sardanapalus',
'assyrian',
'babylon',
'ized',
'library',
'library',
'paper',
'clay',
'tablets',
'writing',
'mesopotamia',
'early',
'sumerian',
'days',
'collection',
'unearthed',
'precious',
'store',
'historical',
'material',
'world',
'chaldean',
'line',
'babylonian',
'monarchs',
'nabonidus',
'keener',
'literary',
'tastes',
'patronized',
'antiquarian',
'researches',
'date',
'worked',
'investigators',
'accession',
'sargon',
'commemorated',
'fact',
'inscriptions',
'signs',
'disunion',
'empire',
'sought',
'centralize',
'bringing',
'number',
'local',
'gods',
'babylon',
'setting',
'temples',
'device',
'practised',
'successfully',
'romans',
'later',
'times',
'babylon',
'roused',
'jealousy',
'powerful',
'priesthood',
'bel',
'marduk',
'dominant',
'god',
'babylonians',
'cast',
'possible',
'alternative',
'nabonidus',
'found',
'cyrus',
'persian',
'ruler',
'adjacent',
'median',
'empire',
'cyrus',
'distinguished',
'conquering',
'croesus',
'rich',
'king',
'lydia',
'eastern',
'asia',
'minor',
'came',
'babylon',
'battle',
'outside',
'walls',
'gates',
'city',
'opened',
'soldiers',
'entered',
'city',
'fighting',
'crown',
'prince',
'belshazzar',
'son',
'nabonidus',
'feasting',
'bible',
'relates',
'hand',
'appeared',
'wrote',
'letters',
'fire',
'wall',
'mystical',
'words',
'mene',
'mene',
'tekel',
'upharsin',
'interpreted',
'prophet',
'daniel',
'summoned',
'read',
'riddle',
'god',
'numbered',
'thy',
'kingdom',
'finished',
'thou',
'art',
'weighed',
'balance',
'found',
'wanting',
'thy',
'kingdom',
'given',
'medes',
'persians',
'possibly',
'priests',
'bel',
'marduk',
'knew',
'writing',
'wall',
'belshazzar',
'killed',
'night',
'says',
'bible',
'nabonidus',
'taken',
'prisoner',
'occupation',
'city',
'peaceful',
'services',
'bel',
'marduk',
'continued',
'intermission']
Exercises#
Filter the tokens from the HG Well’s
text
variable to 1) lowercase all text, 2) remove punctuation, 3) remove spaces and line breaks, 4) remove numbers, and 5) remove stopwords - all in one line!Read through the spacy101 guide and begin to apply its principles to your own corpus: https://spacy.io/usage/spacy-101
Topic modeling - going further#
There are many different approaches to modeling abstract topics in text data, such as top2vec and lda2vec.
Click ahead to see our coverage of the BERTopic algorithm in Chapter 10!