Chapter 9 - New Developments: Topic Modeling with BERTopic!#

2022 July 30

Open In Colab

bertopic

What is BERTopic?#

  • As part of NLP analysis, it’s likely that at some point you will be asked, “What topics are most common in these documents?”

    • Though related, this question is definitely distinct from a query like “What words or phrases are most common in this corpus?”

      • For example, the sentences “I enjoy learning to code.” and “Educating myself on new computer programming techniques makes me happy!” contain wholly unique tokens, but encode a similar sentiment.

      • If possible, we would like to extract generalized topics instead of specific words/phrases to get an idea of what a document is about.

  • This is where BERTopic comes in! BERTopic is a cutting-edge methodology that leverages the transformers defining the base BERT technique along with other ML tools to provide a flexible and powerful topic modeling module (with great visualization support as well!)

  • In this notebook, we’ll go through the operation of BERTopic’s key functionalities and present resources for further exploration.

Required installs:#

# Installs the base bertopic module:
!pip install bertopic 

# If you want to use other transformers/language backends, it may require additional installs: 
!pip install bertopic[flair] # can substitute 'flair' with 'gensim', 'spacy', 'use'

# bertopic also comes with its own handy visualization suite: 
!pip install bertopic[visualization]
Requirement already satisfied: bertopic in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (0.11.0)
Requirement already satisfied: numpy>=1.20.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (1.22.4)
Requirement already satisfied: pandas>=1.1.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (1.4.3)
Requirement already satisfied: hdbscan>=0.8.28 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (0.8.28)
Requirement already satisfied: tqdm>=4.41.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (4.64.0)
Requirement already satisfied: plotly>=4.7.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (5.10.0)
Requirement already satisfied: pyyaml<6.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (5.4.1)
Requirement already satisfied: scikit-learn>=0.22.2.post1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (1.1.2)
Requirement already satisfied: sentence-transformers>=0.4.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (2.2.2)
Requirement already satisfied: umap-learn>=0.5.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (0.5.3)
Requirement already satisfied: scipy>=1.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from hdbscan>=0.8.28->bertopic) (1.9.0)
Requirement already satisfied: cython>=0.27 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from hdbscan>=0.8.28->bertopic) (0.29.32)
Requirement already satisfied: joblib>=1.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from hdbscan>=0.8.28->bertopic) (1.1.0)
Requirement already satisfied: pytz>=2020.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from pandas>=1.1.5->bertopic) (2022.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from pandas>=1.1.5->bertopic) (2.8.2)
Requirement already satisfied: tenacity>=6.2.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from plotly>=4.7.0->bertopic) (8.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from scikit-learn>=0.22.2.post1->bertopic) (3.1.0)
Requirement already satisfied: torchvision in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (0.13.1)
Requirement already satisfied: sentencepiece in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (0.1.97)
Requirement already satisfied: huggingface-hub>=0.4.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (0.8.1)
Requirement already satisfied: torch>=1.6.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (1.12.1)
Requirement already satisfied: nltk in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (3.7)
Requirement already satisfied: transformers<5.0.0,>=4.6.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (4.21.1)
Requirement already satisfied: numba>=0.49 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from umap-learn>=0.5.0->bertopic) (0.56.0)
Requirement already satisfied: pynndescent>=0.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from umap-learn>=0.5.0->bertopic) (0.5.7)
Requirement already satisfied: packaging>=20.9 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (21.3)
Requirement already satisfied: filelock in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (3.8.0)
Requirement already satisfied: requests in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (2.28.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (4.3.0)
Requirement already satisfied: setuptools in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from numba>=0.49->umap-learn>=0.5.0->bertopic) (63.4.1)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from numba>=0.49->umap-learn>=0.5.0->bertopic) (0.39.0)
Requirement already satisfied: six>=1.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas>=1.1.5->bertopic) (1.16.0)
Requirement already satisfied: tokenizers!=0.11.3,<0.13,>=0.11.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (0.12.1)
Requirement already satisfied: regex!=2019.12.17 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (2022.7.25)
Requirement already satisfied: click in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from nltk->sentence-transformers>=0.4.1->bertopic) (8.1.3)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from torchvision->sentence-transformers>=0.4.1->bertopic) (9.2.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from packaging>=20.9->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (3.0.9)
Requirement already satisfied: charset-normalizer<3,>=2 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (2.1.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (1.26.11)
Requirement already satisfied: certifi>=2017.4.17 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (2022.6.15)
Requirement already satisfied: idna<4,>=2.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (3.3)
zsh:1: no matches found: bertopic[flair]
zsh:1: no matches found: bertopic[visualization]

Data sourcing#

  • For this exercise, we’re going to use a popular data set, ‘20 Newsgroups,’ which contains ~18,000 newsgroups posts on 20 topics. This dataset is readily available to us through Scikit-Learn:

import bertopic
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

documents = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']

print(documents[0]) # Any ice hockey fans? 
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!

Creating a BERTopic model:#

  • Using the BERTopic module requires you to fetch an instance of the model. When doing so, you can specify multiple different parameters including:

    • language -> the language of your documents

    • min_topic_size -> the minimum size of a topic; increasing this value will lead to a lower number of topics

    • embedding_model -> what model you want to use to conduct your word embeddings; many are supported!

Example instantiation:#

from sklearn.feature_extraction.text import CountVectorizer 

# example parameter: a custom vectorizer model can be used to remove stopwords from the documents: 
stopwords_vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english') 

# instantiating the model: 
model = BERTopic(vectorizer_model = stopwords_vectorizer)

Fitting the model:#

  • The first step of topic modeling is to fit the model to the documents:

topics, probs = model.fit_transform(documents)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 topics, probs = model.fit_transform(documents)

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/bertopic/_bertopic.py:301, in BERTopic.fit_transform(self, documents, embeddings, y)
    298 if embeddings is None:
    299     self.embedding_model = select_backend(self.embedding_model,
    300                                           language=self.language)
--> 301     embeddings = self._extract_embeddings(documents.Document,
    302                                           method="document",
    303                                           verbose=self.verbose)
    304     logger.info("Transformed documents to Embeddings")
    305 else:

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/bertopic/_bertopic.py:2035, in BERTopic._extract_embeddings(self, documents, method, verbose)
   2033     embeddings = self.embedding_model.embed_words(documents, verbose)
   2034 elif method == "document":
-> 2035     embeddings = self.embedding_model.embed_documents(documents, verbose)
   2036 else:
   2037     raise ValueError("Wrong method for extracting document/word embeddings. "
   2038                      "Either choose 'word' or 'document' as the method. ")

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/bertopic/backend/_base.py:69, in BaseEmbedder.embed_documents(self, document, verbose)
     55 def embed_documents(self,
     56                     document: List[str],
     57                     verbose: bool = False) -> np.ndarray:
     58     """ Embed a list of n words into an n-dimensional
     59     matrix of embeddings
     60 
   (...)
     67         that each have an embeddings size of `m`
     68     """
---> 69     return self.embed(document, verbose)

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/bertopic/backend/_sentencetransformers.py:63, in SentenceTransformerBackend.embed(self, documents, verbose)
     49 def embed(self,
     50           documents: List[str],
     51           verbose: bool = False) -> np.ndarray:
     52     """ Embed a list of n documents/words into an n-dimensional
     53     matrix of embeddings
     54 
   (...)
     61         that each have an embeddings size of `m`
     62     """
---> 63     embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
     64     return embeddings

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/sentence_transformers/SentenceTransformer.py:165, in SentenceTransformer.encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, device, normalize_embeddings)
    162 features = batch_to_device(features, device)
    164 with torch.no_grad():
--> 165     out_features = self.forward(features)
    167     if output_value == 'token_embeddings':
    168         embeddings = []

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/container.py:139, in Sequential.forward(self, input)
    137 def forward(self, input):
    138     for module in self:
--> 139         input = module(input)
    140     return input

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/sentence_transformers/models/Transformer.py:66, in Transformer.forward(self, features)
     63 if 'token_type_ids' in features:
     64     trans_features['token_type_ids'] = features['token_type_ids']
---> 66 output_states = self.auto_model(**trans_features, return_dict=False)
     67 output_tokens = output_states[0]
     69 features.update({'token_embeddings': output_tokens, 'attention_mask': features['attention_mask']})

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:1018, in BertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   1009 head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
   1011 embedding_output = self.embeddings(
   1012     input_ids=input_ids,
   1013     position_ids=position_ids,
   (...)
   1016     past_key_values_length=past_key_values_length,
   1017 )
-> 1018 encoder_outputs = self.encoder(
   1019     embedding_output,
   1020     attention_mask=extended_attention_mask,
   1021     head_mask=head_mask,
   1022     encoder_hidden_states=encoder_hidden_states,
   1023     encoder_attention_mask=encoder_extended_attention_mask,
   1024     past_key_values=past_key_values,
   1025     use_cache=use_cache,
   1026     output_attentions=output_attentions,
   1027     output_hidden_states=output_hidden_states,
   1028     return_dict=return_dict,
   1029 )
   1030 sequence_output = encoder_outputs[0]
   1031 pooled_output = self.pooler(sequence_output) if self.pooler is not None else None

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:607, in BertEncoder.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
    598     layer_outputs = torch.utils.checkpoint.checkpoint(
    599         create_custom_forward(layer_module),
    600         hidden_states,
   (...)
    604         encoder_attention_mask,
    605     )
    606 else:
--> 607     layer_outputs = layer_module(
    608         hidden_states,
    609         attention_mask,
    610         layer_head_mask,
    611         encoder_hidden_states,
    612         encoder_attention_mask,
    613         past_key_value,
    614         output_attentions,
    615     )
    617 hidden_states = layer_outputs[0]
    618 if use_cache:

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:493, in BertLayer.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
    481 def forward(
    482     self,
    483     hidden_states: torch.Tensor,
   (...)
    490 ) -> Tuple[torch.Tensor]:
    491     # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
    492     self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
--> 493     self_attention_outputs = self.attention(
    494         hidden_states,
    495         attention_mask,
    496         head_mask,
    497         output_attentions=output_attentions,
    498         past_key_value=self_attn_past_key_value,
    499     )
    500     attention_output = self_attention_outputs[0]
    502     # if decoder, the last output is tuple of self-attn cache

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:423, in BertAttention.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
    413 def forward(
    414     self,
    415     hidden_states: torch.Tensor,
   (...)
    421     output_attentions: Optional[bool] = False,
    422 ) -> Tuple[torch.Tensor]:
--> 423     self_outputs = self.self(
    424         hidden_states,
    425         attention_mask,
    426         head_mask,
    427         encoder_hidden_states,
    428         encoder_attention_mask,
    429         past_key_value,
    430         output_attentions,
    431     )
    432     attention_output = self.output(self_outputs[0], hidden_states)
    433     outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:351, in BertSelfAttention.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
    348     attention_scores = attention_scores + attention_mask
    350 # Normalize the attention scores to probabilities.
--> 351 attention_probs = nn.functional.softmax(attention_scores, dim=-1)
    353 # This is actually dropping out entire tokens to attend to, which might
    354 # seem a bit unusual, but is taken from the original Transformer paper.
    355 attention_probs = self.dropout(attention_probs)

KeyboardInterrupt: 
  • .fit_transform() returns two outputs:

    • topics contains mappings of inputs (documents) to their modeled topic (alternatively, cluster)

    • probs contains a list of probabilities that an input belongs to their assigned topic

  • Note: fit_transform() can be substituted with fit(). fit_transform() allows for the prediction of new documents but demands additional computing power/time.

Viewing topic modeling results:#

  • The BERTopic module has many built-in methods to view and analyze your fitted model topics. Here are some basics:

# view your topics: 
topics_info = model.get_topic_info()

# get detailed information about the top five most common topics: 
print(topics_info.head(5))
   Topic  Count                                     Name
0     -1   6674                -1_file_use_program_space
1      0   1783              0_game_games_players_season
2      1    612            1_clipper_encryption_chip_nsa
3      2    525  2_ken huh_ites yep_huh lets_forget ites
4      3    475        3_israel_israeli_jews_palestinian
  • When examining topic information, you may see a topic with the assigned number ‘-1.’ Topic -1 refers to all input outliers which do not have a topic assigned and should typically be ignored during analysis.

  • Forcing documents into a topic could decrease the quality of the topics generated, so it’s usually a good idea to allow the model to discard inputs into this ‘Topic -1’ bin.

# access a single topic: 
print(model.get_topic(topic=0)) # .get_topics() accesses all topics
[('game', 0.008556967659211701), ('games', 0.006035198472503292), ('players', 0.005414535850522633), ('season', 0.005292677001045096), ('hockey', 0.0052528755903495), ('league', 0.004292414022656312), ('teams', 0.003992894205978727), ('baseball', 0.003759511650698762), ('nhl', 0.003528973766679468), ('gm', 0.0030145168667122632)]
# get representative documents for a specific topic: 
print(model.get_representative_docs(topic=0)) # omit the 'topic' parameter to get docs for all topics 
['Disclaimer -- This is for fun.\n\nIn my computerized baseball game, I keep track of a category called\n"stolen hits", defined as a play made that "an average fielder would not\nmake with average effort."  Using the 1992 Defensive Averages posted\nby Sherri Nichols (Thanks Sherri!), I\'ve figured out some defensive stats\nfor the centerfielders. Hits Stolen have been redefined as "Plays Juan\nGonzalez would not have made."\n\nOK, I realize that\'s unfair.  Juan\'s probably the victim of pitching staff,\nfluke shots, and a monster park factor.  But let\'s put it this way:  If we\nreplaced every centerfielder in the league with someone with Kevin\'s 55.4% out\nmaking ability, how many extra hits would go by?\n\nTo try and correlate it to reality a little more, I\'ve calculated Net\nHits Stolen, based on the number of outs made compared to what a league\naverage fielder would make.  By the same method I\'ve calculated Net Extra \nBases (doubles and triples let by).\n\nFinally, I throw all this into a a formula I call Defensive Contribution, or\nDCON :->.  Basically, it represents the defensive contribution of a player.\nI add this number to OPS to get DOPS (Defense + Onbase Plus Slug), which\nshould represent the player\'s total contribution to the team.  So don\'t\ntake it too seriously.  The formula for DCON appears at the end of this\narticle.\n\nThe short version -- definition of terms\nHS -- Hits Stolen -- Extra outs compared to Kurt Stillwell\nNHS -- Net Hits Stolen -- Extra outs compared to average fielder\nNDP -- Net Double Plays -- Extra double plays turned compared to avg fielder\nNEB -- Net Extra Bases --  Extra bases prevented compared to avg. fielder\nDCON -- Defensive Contribution -- bases and hits prevented, as a rate.\nDOPS -- DCON + OPS -- quick & dirty measure of player\'s total contribution.\n\nNational League\n\nName            HS   NHS   NEB   DCON    DOPS\nNixon, O.       34    12    15   .083    .777\nGrissom, M.     48    18    12   .072    .812\nJackson, D.     46    13    20   .060    .735\nLewis, D.       25     8    -6   .029    .596\nDykstra, L.     25     5    -5   .013    .794\nDascenzo, D.    10    -5    10   .001    .616\nFinley, S.      32    -2     2  -.003    .759\nLankford, R.    39     4   -12  -.007    .844\nMartinez, D.    21     5   -16  -.017    .660\nVanSlyke, A.    30    -4   -17  -.040    .846\nSanders, R.      7   -10    -4  -.059    .759\nButler, B.       1   -29     5  -.088    .716\nJohnson, H.      3   -12   -19  -.118    .548\n\nOrdered by DOPS\n\n.846 VanSlyke\n.844 Lankford\n.812 Grissom\n.794 Dykstra\n.777 Nixon\n.759 Finley\n.759 Sanders\n.735 Jackson\n.730 *NL Average*\n.716 Butler\n.660 Martinez\n.616 Dascenzo\n.596 Lewis\n.548 Johnson\n\nAmerican League\n---------------\n\nName            HS   NHS   NEB   DCON    DOPS\nLofton, K.      57    32    17   .220    .947\nWilson, W.      47    26     0   .125    .787\nWhite, D.       52    25    28   .119    .812\nFelix, J.       22     0    32   .063    .713\nDevereaux, M.   43    16     0   .047    .832\nMcRae, H.       38    11    -1   .038    .631\nYount, R.       31     8    -3   .022    .737\nKelly, R.       13    -6    -3  -.025    .681\nJohnson, L.     23    -5   -13  -.040    .641\nGriffey, K.     15    -9   -12  -.052    .844\nPuckett, K.     13   -13   -15  -.063    .801\nCuyler, M.       6   -10    -6  -.088    .503\nGonzalez, J.     0   -21   -15  -.095    .738\n\n\nOrder by DOPS\n\n.947 Lofton\n.844 Griffey\n.832 Devereaux\n.812 White\n.801 Puckett\n.787 Wilson\n.738 Gonzalez\n.737 Yount\n.713 Felix\n.709 *AL Average*\n.681 Kelly\n.641 Johnson\n.631 McRae\n.503 Cuyler\n\nMore discussion --\n\nDCON formula:  ((NHS + NDP)/PA) + ((NHS + NDP + NEB)/AB)\nWhy such a bizzare formula?  Basically, it\'s designed to be added into the\nOPS, with the idea that "a run prevented is as important as a run scored".\nThe extra outs are factored into OBP, while the extra bases removed are \nfactored into SLG.  That\'s why I used PA and AB as the divisors.\n\nFor more discussion see the post on Hits Stolen -- First Base 1992\n-- \nDale J. Stephenson |*| (steph@cs.uiuc.edu) |*| Baseball fanatic', '\n\n\n\n\n\n\nIs the answer as simple as that you dislike russians???\n\n\n\n\nAnd where would canadian hockey be today without the europeans?? Dont say\nthat the european influence on the league has been all bad for the game.\nI mean, look at the way you play these days. Less fights and more hockey.\nImho, canadian hockey has had a positive curve of development since the\n70\'s when the game was more brute than beauty......\n\n\nOh, look!! You don\'t like Finns either....\n\nToo bad almost all of you northamericans originates from europe.....\n\nHmmm... And what kind of a name is Rauser. Doesn\'t sound very "canadian" to\nme. ;-)', "\tAh, so now we're into European player bashing?  What next?  \nNo more French Canadiens?  Yeah, there's an idea!  Let them French-\nspeaking Canadiens have their own hockey league!  We don't want them!\n\tAre you _CRAZY_?  The NHL is one of the true international\nleagues, and yes, there _ARE_ many Europeans who deserve to play in\nthe NHL and are better than some North Americans, look at Teemu!!!\nI, for one, am glad to see Europeans in the NHL and I hope the\nNHL soon expands to Europe.  Its nice to see all these different\npeople come together to form the (soon to be) 26 hockey teams.\n\t\n\nDarryl Brooks                    University at Buffalo\n                __                 ______                        ///\n       | |     /  \\  \\ \\     / /  / _____          / /         ////\n       | |    / /\\ \\  \\ \\___/ /  (  \\          ---/-/---       ///\n       | |   / /__\\ \\   \\   /      \\  \\       ---/-/---       ///\n \\______/  / /      \\ \\  | |     ______/                  ///////"]
# find topics similar to a key term/phrase: 
topics, similarity_scores = model.find_topics("sports", top_n = 5)
print("Most common topics:" + str(topics)) # view the numbers of the top-5 most similar topics

# print the initial contents of the most similar topics
for topic_num in topics: 
    print('\nContents from topic number: '+ str(topic_num) + '\n')
    print(model.get_topic(topic_num))
    
Most common topics:[0, 7, 24, 180, 111]

Contents from topic number: 0

[('game', 0.008556967659211701), ('games', 0.006035198472503292), ('players', 0.005414535850522633), ('season', 0.005292677001045096), ('hockey', 0.0052528755903495), ('league', 0.004292414022656312), ('teams', 0.003992894205978727), ('baseball', 0.003759511650698762), ('nhl', 0.003528973766679468), ('gm', 0.0030145168667122632)]

Contents from topic number: 7

[('bike', 0.01673826325614919), ('riding', 0.012539703475904924), ('ride', 0.011598173417616448), ('driving', 0.007891274351624324), ('traffic', 0.007206925821355247), ('road', 0.006767514327944913), ('bikes', 0.0047496125389362084), ('riders', 0.004659732250273723), ('speed', 0.00446265563654249), ('roads', 0.0040894899998972405)]

Contents from topic number: 24

[('games', 0.031005009699471373), ('joystick', 0.023516729868248754), ('sega', 0.022659088021281183), ('arcade', 0.011657116816441129), ('snes', 0.01042610977151653), ('sega genesis', 0.010368839296791447), ('joysticks', 0.00986371422929634), ('games sale', 0.009666588744639134), ('sale', 0.00918498185236209), ('sega cd', 0.007095778186846751)]

Contents from topic number: 180

[('religion', 0.033218459176287435), ('war', 0.023397937242953797), ('crusades', 0.018002896196492476), ('wars', 0.017593432338661088), ('killing religiously', 0.014426582063284546), ('religious', 0.013200118649577486), ('statement religion', 0.01052722528644783), ('ireland', 0.010356135704588581), ('northern ireland', 0.009617981061839245), ('catholics killed', 0.007819453848047997)]

Contents from topic number: 111

[('life', 0.017248267878068352), ('christianity', 0.014784747868165438), ('christian', 0.011263045735546998), ('kendigianism', 0.008245592976949124), ('christians', 0.008196511949605393), ('amusement park', 0.0070310625051416935), ('bible', 0.0065473302157617345), ('santa claus', 0.0060120062498362084), ('reindeer', 0.005781249899900431), ('religion', 0.005546596003166386)]

Saving/loading models:#

  • One of the most obvious drawbacks of using the BERTopic technique is the algorithm’s run-time. But, rather than re-running a script every time you want to conduct topic modeling analysis, you can simply save/load models!

# save your model: 
# model.save("TAML_ex_model")
# load it later: 
# loaded_model = BERTopic.load("TAML_ex_model")

Visualizing topics:#

  • Although the prior methods can be used to manually examine the textual contents of topics, visualizations can be an excellent way to succinctly communicate the same information.

  • Depending on the visualization, it can even reveal patterns that would be much harder/impossible to see through textual analysis - like inter-topic distance!

  • Let’s see some examples!

# Create a 2D representation of your modeled topics & their pairwise distances: 
model.visualize_topics()
# Get the words and probabilities of top topics, but in bar chart form! 
model.visualize_barchart()
# Evaluate topic similarity through a heat map: 
model.visualize_heatmap()

Conclusion#

Exercise#

  1. Repeat the steps in this notebook with your own data. However, real data does not come with a fetch function. What importation steps do you need to consider so your own corpus works?