Chapter 9 - New Developments: Topic Modeling with BERTopic!#

2022 July 30

Open In Colab

bertopic

What is BERTopic?#

  • As part of NLP analysis, it’s likely that at some point you will be asked, “What topics are most common in these documents?”

    • Though related, this question is definitely distinct from a query like “What words or phrases are most common in this corpus?”

      • For example, the sentences “I enjoy learning to code.” and “Educating myself on new computer programming techniques makes me happy!” contain wholly unique tokens, but encode a similar sentiment.

      • If possible, we would like to extract generalized topics instead of specific words/phrases to get an idea of what a document is about.

  • This is where BERTopic comes in! BERTopic is a cutting-edge methodology that leverages the transformers defining the base BERT technique along with other ML tools to provide a flexible and powerful topic modeling module (with great visualization support as well!)

  • In this notebook, we’ll go through the operation of BERTopic’s key functionalities and present resources for further exploration.

Required installs:#

# Installs the base bertopic module:
# !pip install bertopic 

# If you want to use other transformers/language backends, it may require additional installs: 
# !pip install bertopic[flair] # can substitute 'flair' with 'gensim', 'spacy', 'use'

# bertopic also comes with its own handy visualization suite: 
# !pip install bertopic[visualization]

Data sourcing#

  • For this exercise, we’re going to use a popular data set, ‘20 Newsgroups,’ which contains ~18,000 newsgroups posts on 20 topics. This dataset is readily available to us through Scikit-Learn:

import bertopic
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

documents = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']

print(documents[0]) # Any ice hockey fans? 
2023-05-20 14:50:27.233253: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!

Creating a BERTopic model:#

  • Using the BERTopic module requires you to fetch an instance of the model. When doing so, you can specify multiple different parameters including:

    • language -> the language of your documents

    • min_topic_size -> the minimum size of a topic; increasing this value will lead to a lower number of topics

    • embedding_model -> what model you want to use to conduct your word embeddings; many are supported!

Example instantiation:#

from sklearn.feature_extraction.text import CountVectorizer 

# example parameter: a custom vectorizer model can be used to remove stopwords from the documents: 
stopwords_vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english') 

# instantiating the model: 
model = BERTopic(vectorizer_model = stopwords_vectorizer)

Fitting the model:#

  • The first step of topic modeling is to fit the model to the documents:

topics, probs = model.fit_transform(documents)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[4], line 1
----> 1 topics, probs = model.fit_transform(documents)

File ~/opt/anaconda3/lib/python3.8/site-packages/bertopic/_bertopic.py:282, in BERTopic.fit_transform(self, documents, embeddings, y)
    279 if embeddings is None:
    280     self.embedding_model = select_backend(self.embedding_model,
    281                                           language=self.language)
--> 282     embeddings = self._extract_embeddings(documents.Document,
    283                                           method="document",
    284                                           verbose=self.verbose)
    285     logger.info("Transformed documents to Embeddings")
    286 else:

File ~/opt/anaconda3/lib/python3.8/site-packages/bertopic/_bertopic.py:1335, in BERTopic._extract_embeddings(self, documents, method, verbose)
   1333     embeddings = self.embedding_model.embed_words(documents, verbose)
   1334 elif method == "document":
-> 1335     embeddings = self.embedding_model.embed_documents(documents, verbose)
   1336 else:
   1337     raise ValueError("Wrong method for extracting document/word embeddings. "
   1338                      "Either choose 'word' or 'document' as the method. ")

File ~/opt/anaconda3/lib/python3.8/site-packages/bertopic/backend/_base.py:69, in BaseEmbedder.embed_documents(self, document, verbose)
     55 def embed_documents(self,
     56                     document: List[str],
     57                     verbose: bool = False) -> np.ndarray:
     58     """ Embed a list of n words into an n-dimensional
     59     matrix of embeddings
     60 
   (...)
     67         that each have an embeddings size of `m`
     68     """
---> 69     return self.embed(document, verbose)

File ~/opt/anaconda3/lib/python3.8/site-packages/bertopic/backend/_sentencetransformers.py:63, in SentenceTransformerBackend.embed(self, documents, verbose)
     49 def embed(self,
     50           documents: List[str],
     51           verbose: bool = False) -> np.ndarray:
     52     """ Embed a list of n documents/words into an n-dimensional
     53     matrix of embeddings
     54 
   (...)
     61         that each have an embeddings size of `m`
     62     """
---> 63     embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
     64     return embeddings

File ~/opt/anaconda3/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py:184, in SentenceTransformer.encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, device, normalize_embeddings)
    182 for start_index in trange(0, len(sentences), batch_size, desc="Batches", disable=not show_progress_bar):
    183     sentences_batch = sentences_sorted[start_index:start_index+batch_size]
--> 184     features = self.tokenize(sentences_batch)
    185     features = batch_to_device(features, device)
    187     with torch.no_grad():

File ~/opt/anaconda3/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py:336, in SentenceTransformer.tokenize(self, texts)
    332 def tokenize(self, texts: Union[List[str], List[Dict], List[Tuple[str, str]]]):
    333     """
    334     Tokenizes the texts
    335     """
--> 336     return self._first_module().tokenize(texts)

File ~/opt/anaconda3/lib/python3.8/site-packages/sentence_transformers/models/Transformer.py:91, in Transformer.tokenize(self, texts)
     87 if self.do_lower_case:
     88     to_tokenize = [[s.lower() for s in col] for col in to_tokenize]
---> 91 output.update(self.tokenizer(*to_tokenize, padding=True, truncation='longest_first', return_tensors="pt", max_length=self.max_seq_length))
     92 return output

File ~/opt/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:2197, in PreTrainedTokenizerBase.__call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2195 if is_batched:
   2196     batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
-> 2197     return self.batch_encode_plus(
   2198         batch_text_or_text_pairs=batch_text_or_text_pairs,
   2199         add_special_tokens=add_special_tokens,
   2200         padding=padding,
   2201         truncation=truncation,
   2202         max_length=max_length,
   2203         stride=stride,
   2204         is_split_into_words=is_split_into_words,
   2205         pad_to_multiple_of=pad_to_multiple_of,
   2206         return_tensors=return_tensors,
   2207         return_token_type_ids=return_token_type_ids,
   2208         return_attention_mask=return_attention_mask,
   2209         return_overflowing_tokens=return_overflowing_tokens,
   2210         return_special_tokens_mask=return_special_tokens_mask,
   2211         return_offsets_mapping=return_offsets_mapping,
   2212         return_length=return_length,
   2213         verbose=verbose,
   2214         **kwargs,
   2215     )
   2216 else:
   2217     return self.encode_plus(
   2218         text=text,
   2219         text_pair=text_pair,
   (...)
   2235         **kwargs,
   2236     )

File ~/opt/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:2382, in PreTrainedTokenizerBase.batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2372 # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
   2373 padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
   2374     padding=padding,
   2375     truncation=truncation,
   (...)
   2379     **kwargs,
   2380 )
-> 2382 return self._batch_encode_plus(
   2383     batch_text_or_text_pairs=batch_text_or_text_pairs,
   2384     add_special_tokens=add_special_tokens,
   2385     padding_strategy=padding_strategy,
   2386     truncation_strategy=truncation_strategy,
   2387     max_length=max_length,
   2388     stride=stride,
   2389     is_split_into_words=is_split_into_words,
   2390     pad_to_multiple_of=pad_to_multiple_of,
   2391     return_tensors=return_tensors,
   2392     return_token_type_ids=return_token_type_ids,
   2393     return_attention_mask=return_attention_mask,
   2394     return_overflowing_tokens=return_overflowing_tokens,
   2395     return_special_tokens_mask=return_special_tokens_mask,
   2396     return_offsets_mapping=return_offsets_mapping,
   2397     return_length=return_length,
   2398     verbose=verbose,
   2399     **kwargs,
   2400 )

File ~/opt/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils.py:553, in PreTrainedTokenizer._batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
    550     second_ids = get_input_ids(pair_ids) if pair_ids is not None else None
    551     input_ids.append((first_ids, second_ids))
--> 553 batch_outputs = self._batch_prepare_for_model(
    554     input_ids,
    555     add_special_tokens=add_special_tokens,
    556     padding_strategy=padding_strategy,
    557     truncation_strategy=truncation_strategy,
    558     max_length=max_length,
    559     stride=stride,
    560     pad_to_multiple_of=pad_to_multiple_of,
    561     return_attention_mask=return_attention_mask,
    562     return_token_type_ids=return_token_type_ids,
    563     return_overflowing_tokens=return_overflowing_tokens,
    564     return_special_tokens_mask=return_special_tokens_mask,
    565     return_length=return_length,
    566     return_tensors=return_tensors,
    567     verbose=verbose,
    568 )
    570 return BatchEncoding(batch_outputs)

File ~/opt/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils.py:601, in PreTrainedTokenizer._batch_prepare_for_model(self, batch_ids_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_length, verbose)
    599 batch_outputs = {}
    600 for first_ids, second_ids in batch_ids_pairs:
--> 601     outputs = self.prepare_for_model(
    602         first_ids,
    603         second_ids,
    604         add_special_tokens=add_special_tokens,
    605         padding=PaddingStrategy.DO_NOT_PAD.value,  # we pad in batch afterward
    606         truncation=truncation_strategy.value,
    607         max_length=max_length,
    608         stride=stride,
    609         pad_to_multiple_of=None,  # we pad in batch afterward
    610         return_attention_mask=False,  # we pad in batch afterward
    611         return_token_type_ids=return_token_type_ids,
    612         return_overflowing_tokens=return_overflowing_tokens,
    613         return_special_tokens_mask=return_special_tokens_mask,
    614         return_length=return_length,
    615         return_tensors=None,  # We convert the whole batch to tensors at the end
    616         prepend_batch_axis=False,
    617         verbose=verbose,
    618     )
    620     for key, value in outputs.items():
    621         if key not in batch_outputs:

File ~/opt/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:2700, in PreTrainedTokenizerBase.prepare_for_model(self, ids, pair_ids, add_special_tokens, padding, truncation, max_length, stride, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, prepend_batch_axis, **kwargs)
   2698 overflowing_tokens = []
   2699 if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE and max_length and total_len > max_length:
-> 2700     ids, pair_ids, overflowing_tokens = self.truncate_sequences(
   2701         ids,
   2702         pair_ids=pair_ids,
   2703         num_tokens_to_remove=total_len - max_length,
   2704         truncation_strategy=truncation_strategy,
   2705         stride=stride,
   2706     )
   2708 if return_overflowing_tokens:
   2709     encoded_inputs["overflowing_tokens"] = overflowing_tokens

File ~/opt/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:2817, in PreTrainedTokenizerBase.truncate_sequences(self, ids, pair_ids, num_tokens_to_remove, truncation_strategy, stride)
   2815         window_len = 1
   2816     overflowing_tokens.extend(ids[-window_len:])
-> 2817     ids = ids[:-1]
   2818 else:
   2819     if not overflowing_tokens:

KeyboardInterrupt: 
  • .fit_transform() returns two outputs:

    • topics contains mappings of inputs (documents) to their modeled topic (alternatively, cluster)

    • probs contains a list of probabilities that an input belongs to their assigned topic

  • Note: fit_transform() can be substituted with fit(). fit_transform() allows for the prediction of new documents but demands additional computing power/time.

Viewing topic modeling results:#

  • The BERTopic module has many built-in methods to view and analyze your fitted model topics. Here are some basics:

# view your topics: 
topics_info = model.get_topic_info()

# get detailed information about the top five most common topics: 
print(topics_info.head(5))
   Topic  Count                                       Name
0     -1   6440                -1_file_use_new_information
1      0   1824                0_team_games_players_season
2      1    560              1_clipper_chip_encryption_nsa
3      2    526  2_dancing idjits_nate ites_ken huh_idjits
4      3    482          3_israel_israeli_jews_palestinian
  • When examining topic information, you may see a topic with the assigned number ‘-1.’ Topic -1 refers to all input outliers which do not have a topic assigned and should typically be ignored during analysis.

  • Forcing documents into a topic could decrease the quality of the topics generated, so it’s usually a good idea to allow the model to discard inputs into this ‘Topic -1’ bin.

# access a single topic: 
print(model.get_topic(topic=0)) # .get_topics() accesses all topics
[('team', 0.007654185388395912), ('games', 0.006113919016048703), ('players', 0.005424915037914894), ('season', 0.005345216974048173), ('hockey', 0.005230859348275039), ('league', 0.004282216039504247), ('teams', 0.003992926231473371), ('baseball', 0.003806963864200249), ('nhl', 0.0035165100537772457), ('gm', 0.0029919572953508375)]
# get representative documents for a specific topic: 
print(model.get_representative_docs(topic=0)) # omit the 'topic' parameter to get docs for all topics 
['I am selling a one way ticket from Washington DC to Champaign, IL ( the\nhome of the University of Illinois).  Am willing to offer a good price.\n\nIf you are interested, please email me at:  eshneken@uiuc.edu', "Well, it's not that bad. But I am still pretty pissed of at the\nlocal ABC coverage. They cut off the first half hour of coverage by playing\nDavid Brinkley at 12:30 instead of an earlier time slot. I don't\neven understand their problem. If they didnt think enough people would\n\nnot watch the game why would they decide to show most of the game? And\nif they showed the remaining 2.5 hours of the game, would it hurt to play\nDavid Brinkley at its regular time? They dont have any decent programming\nbefore noon anyway. I called the sports dept and blasted them on their\nmachine. I called gain and someone picked it up. When I asked him why they\npremepted the first half hour of the Stanley Cup playoffs, he seemed a bit\nconfused. When I explained a bit more in detail, he then said that's upto\nto our programming dept. call back on  Monday. weel, I understand that the\nsports dept is not responsible for this preemption. BUt I can't understand\nhow someone in the sports dept. can't even recognise the name of playoffs\nshown on the very same station he works for.\n\nAnyway, I am going to call them tomorrow and blast them on the phone again.\nI urge all Atlanta hockey fans to call WSB 2 and ask them not to do the\nsame thing for the next 4 weeks.", '\n\n\nHank Greenberg would have to be the most famous, because his Jewish\nfaith actually affected his play. (missing late season or was it world\nseries games because of Yom Kippur)\n\n\n']
# find topics similar to a key term/phrase: 
topics, similarity_scores = model.find_topics("sports", top_n = 5)
print("Most common topics:" + str(topics)) # view the numbers of the top-5 most similar topics

# print the initial contents of the most similar topics
for topic_num in topics: 
    print('\nContents from topic number: '+ str(topic_num) + '\n')
    print(model.get_topic(topic_num))
    
Most common topics:[0, 30, 95, 117, 193]

Contents from topic number: 0

[('team', 0.007654185388395912), ('games', 0.006113919016048703), ('players', 0.005424915037914894), ('season', 0.005345216974048173), ('hockey', 0.005230859348275039), ('league', 0.004282216039504247), ('teams', 0.003992926231473371), ('baseball', 0.003806963864200249), ('nhl', 0.0035165100537772457), ('gm', 0.0029919572953508375)]

Contents from topic number: 30

[('games', 0.031446897729878846), ('sega', 0.022570308811437154), ('joystick', 0.018742853517673526), ('arcade', 0.011605502363103037), ('joysticks', 0.010712771412095111), ('snes', 0.010381522669573035), ('sega genesis', 0.01032054648812185), ('games sale', 0.009620488152877653), ('sale', 0.009189808730603096), ('sega cd', 0.007060497762869501)]

Contents from topic number: 95

[('countersteering', 0.051451643138064156), ('motorcycle', 0.02136572878983802), ('technique', 0.0202215731625135), ('riding', 0.01945494045982923), ('steering', 0.015516708246742955), ('handlebars', 0.012921754681028613), ('riders', 0.011339415397041431), ('wheels', 0.011005113485526831), ('like motorcycle', 0.009162645964962236), ('gyroscopes', 0.009162645964962236)]

Contents from topic number: 117

[('helmet', 0.12958294535687548), ('cb', 0.027600226097084818), ('helmets', 0.01938441508053053), ('leave helmet', 0.015231816373511889), ('helmet mirror', 0.012522910265258222), ('helmet seat', 0.012522910265258222), ('weight helmet', 0.00971847616486251), ('foam liner', 0.00971847616486251), ('place helmet', 0.00971847616486251), ('fit', 0.009067344841681743)]

Contents from topic number: 193

[('life', 0.019405145972759955), ('kendigianism', 0.018878088625155012), ('christianity', 0.011828756004468866), ('good life', 0.011616737516937304), ('drug', 0.01010323851149326), ('christianity drug', 0.00992501927235557), ('christian', 0.009441830038657032), ('bible', 0.009290144891533235), ('religion', 0.008943159888367086), ('god says', 0.008423505526655968)]

Saving/loading models:#

  • One of the most obvious drawbacks of using the BERTopic technique is the algorithm’s run-time. But, rather than re-running a script every time you want to conduct topic modeling analysis, you can simply save/load models!

# save your model: 
# model.save("TAML_ex_model")
# load it later: 
# loaded_model = BERTopic.load("TAML_ex_model")

Visualizing topics:#

  • Although the prior methods can be used to manually examine the textual contents of topics, visualizations can be an excellent way to succinctly communicate the same information.

  • Depending on the visualization, it can even reveal patterns that would be much harder/impossible to see through textual analysis - like inter-topic distance!

  • Let’s see some examples!

# Create a 2D representation of your modeled topics & their pairwise distances: 
model.visualize_topics()
# Get the words and probabilities of top topics, but in bar chart form! 
model.visualize_barchart()
# Evaluate topic similarity through a heat map: 
model.visualize_heatmap()
/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/evanmuzzall/opt/anaconda3/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"

Conclusion#

Exercise#

  1. Repeat the steps in this notebook with your own data. However, real data does not come with a fetch function. What importation steps do you need to consider so your own corpus works?