Chapter 4.5 - New Developments: Topic Modeling with BERTopic!#

2022 July 30

bertopic

What is BERTopic?#

  • As part of NLP analysis, it’s likely that at some point you will be asked, “What topics are most common in these documents?”

    • Though related, this question is definitely distinct from a query like “What words or phrases are most common in this corpus?”

      • For example, the sentences “I enjoy learning to code.” and “Educating myself on new computer programming techniques makes me happy!” contain wholly unique tokens, but encode a similar sentiment.

      • If possible, we would like to extract generalized topics instead of specific words/phrases to get an idea of what a document is about.

  • This is where BERTopic comes in! BERTopic is a cutting-edge methodology that leverages the transformers defining the base BERT technique along with other ML tools to provide a flexible and powerful topic modeling module (with great visualization support as well!)

  • In this notebook, we’ll go through the operation of BERTopic’s key functionalities and present resources for further exploration.

Required installs:#

# Installs the base bertopic module:
!pip install bertopic 

# If you want to use other transformers/language backends, it may require additional installs: 
!pip install bertopic[flair] # can substitute 'flair' with 'gensim', 'spacy', 'use'

# bertopic also comes with its own handy visualization suite: 
!pip install bertopic[visualization]
Collecting bertopic
  Using cached bertopic-0.11.0-py2.py3-none-any.whl (76 kB)
Collecting scikit-learn>=0.22.2.post1
  Using cached scikit_learn-1.1.2-cp39-cp39-macosx_10_9_x86_64.whl (8.7 MB)
Requirement already satisfied: pandas>=1.1.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (1.4.3)
Collecting plotly>=4.7.0
  Using cached plotly-5.9.0-py2.py3-none-any.whl (15.2 MB)
Collecting pyyaml<6.0
  Using cached PyYAML-5.4.1-cp39-cp39-macosx_10_9_x86_64.whl (259 kB)
Collecting hdbscan>=0.8.28
  Using cached hdbscan-0.8.28.tar.gz (5.2 MB)
  Installing build dependencies ... ?25l-
 \
 |
 /
 -
 done
?25h  Getting requirements to build wheel ... ?25l-
 done
?25h  Preparing metadata (pyproject.toml) ... ?25l-
 done
?25hCollecting umap-learn>=0.5.0
  Using cached umap-learn-0.5.3.tar.gz (88 kB)
  Preparing metadata (setup.py) ... ?25l-
 done
?25hCollecting sentence-transformers>=0.4.1
  Using cached sentence-transformers-2.2.2.tar.gz (85 kB)
  Preparing metadata (setup.py) ... ?25l-
 \ done
?25hRequirement already satisfied: numpy>=1.20.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (1.23.1)
Requirement already satisfied: tqdm>=4.41.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (4.64.0)
Requirement already satisfied: scipy>=1.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from hdbscan>=0.8.28->bertopic) (1.9.0)
Collecting cython>=0.27
  Using cached Cython-0.29.32-py2.py3-none-any.whl (986 kB)
Requirement already satisfied: joblib>=1.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from hdbscan>=0.8.28->bertopic) (1.1.0)
Requirement already satisfied: python-dateutil>=2.8.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from pandas>=1.1.5->bertopic) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from pandas>=1.1.5->bertopic) (2022.1)
Collecting tenacity>=6.2.0
  Using cached tenacity-8.0.1-py3-none-any.whl (24 kB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Collecting transformers<5.0.0,>=4.6.0
  Using cached transformers-4.21.1-py3-none-any.whl (4.7 MB)
Collecting torch>=1.6.0
  Using cached torch-1.12.1-cp39-none-macosx_10_9_x86_64.whl (133.8 MB)
Collecting torchvision
  Using cached torchvision-0.13.1-cp39-cp39-macosx_10_9_x86_64.whl (1.3 MB)
Requirement already satisfied: nltk in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (3.7)
Collecting sentencepiece
  Using cached sentencepiece-0.1.97-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB)
Collecting huggingface-hub>=0.4.0
  Using cached huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
Collecting numba>=0.49
  Using cached numba-0.56.0-cp39-cp39-macosx_10_14_x86_64.whl (2.4 MB)
Collecting pynndescent>=0.5
  Using cached pynndescent-0.5.7.tar.gz (1.1 MB)
  Preparing metadata (setup.py) ... ?25l-
 done
?25hRequirement already satisfied: typing-extensions>=3.7.4.3 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (4.3.0)
Requirement already satisfied: packaging>=20.9 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (21.3)
Collecting filelock
  Using cached filelock-3.7.1-py3-none-any.whl (10 kB)
Requirement already satisfied: requests in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (2.28.1)
Collecting llvmlite<0.40,>=0.39.0dev0
  Using cached llvmlite-0.39.0-cp39-cp39-macosx_10_9_x86_64.whl (25.5 MB)
Requirement already satisfied: setuptools in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from numba>=0.49->umap-learn>=0.5.0->bertopic) (63.4.1)
Collecting numpy>=1.20.0
  Using cached numpy-1.22.4-cp39-cp39-macosx_10_15_x86_64.whl (17.7 MB)
Requirement already satisfied: six>=1.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas>=1.1.5->bertopic) (1.16.0)
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Using cached tokenizers-0.12.1-cp39-cp39-macosx_10_11_x86_64.whl (3.6 MB)
Requirement already satisfied: regex!=2019.12.17 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (2022.7.25)
Requirement already satisfied: click in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from nltk->sentence-transformers>=0.4.1->bertopic) (8.1.3)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from torchvision->sentence-transformers>=0.4.1->bertopic) (9.2.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from packaging>=20.9->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (3.0.9)
Requirement already satisfied: certifi>=2017.4.17 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (2022.6.15)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (1.26.11)
Requirement already satisfied: charset-normalizer<3,>=2 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (2.1.0)
Requirement already satisfied: idna<4,>=2.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (3.3)
Building wheels for collected packages: hdbscan, sentence-transformers, umap-learn, pynndescent
  Building wheel for hdbscan (pyproject.toml) ... ?25l-
 \
 |
 /
 -
 \
^C
?25h canceled
ERROR: Operation cancelled by user

zsh:1: no matches found: bertopic[flair]
zsh:1: no matches found: bertopic[visualization]

Data sourcing#

  • For this exercise, we’re going to use a popular data set, ‘20 Newsgroups,’ which contains ~18,000 newsgroups posts on 20 topics. This dataset is readily available to us through Scikit-Learn:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

documents = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']

print(documents[0]) # Any ice hockey fans? 
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!

Creating a BERTopic model:#

  • Using the BERTopic module requires you to fetch an instance of the model. When doing so, you can specify multiple different parameters including:

    • language -> the language of your documents

    • min_topic_size -> the minimum size of a topic; increasing this value will lead to a lower number of topics

    • embedding_model -> what model you want to use to conduct your word embeddings; many are supported!

Example instantiation:#

from sklearn.feature_extraction.text import CountVectorizer 

# example parameter: a custom vectorizer model can be used to remove stopwords from the documents: 
stopwords_vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english') 

# instantiating the model: 
model = BERTopic(vectorizer_model = stopwords_vectorizer)

Fitting the model:#

  • The first step of topic modeling is to fit the model to the documents:

topics, probs = model.fit_transform(documents)
  • .fit_transform() returns two outputs:

    • topics contains mappings of inputs (documents) to their modeled topic (alternatively, cluster)

    • probs contains a list of probabilities that an input belongs to their assigned topic

  • Note: fit_transform() can be substituted with fit(). fit_transform() allows for the prediction of new documents but demands additional computing power/time.

Viewing topic modeling results:#

  • The BERTopic module has many built-in methods to view and analyze your fitted model topics. Here are some basics:

# view your topics: 
topics_info = model.get_topic_info()

# get detailed information about the top five most common topics: 
print(topics_info.head(5))
   Topic  Count                                       Name
0     -1   6198                    -1_like_use_know_people
1      0   1836                  0_game_team_games_players
2      1    572              1_key_clipper_chip_encryption
3      2    527  2_whatta ass_ken huh_forget ites_ites yep
4      3    458               3_monitor_card_video_drivers
  • When examining topic information, you may see a topic with the assigned number ‘-1.’ Topic -1 refers to all input outliers which do not have a topic assigned and should typically be ignored during analysis.

  • Forcing documents into a topic could decrease the quality of the topics generated, so it’s usually a good idea to allow the model to discard inputs into this ‘Topic -1’ bin.

# access a single topic: 
print(model.get_topic(topic=0)) # .get_topics() accesses all topics
[('game', 0.009178662518274271), ('team', 0.007968780811213744), ('games', 0.006334453560836352), ('players', 0.005535902899143252), ('season', 0.00547486906998158), ('hockey', 0.005363234415036895), ('play', 0.005086475485566249), ('year', 0.005000281340771309), ('25', 0.004999415937511647), ('league', 0.004376256337038951)]
# get representative documents for a specific topic: 
print(model.get_representative_docs(topic=0)) # omit the 'topic' parameter to get docs for all topics 
["\n\n\n Hmmm...what about walks and SB? Baerga got clobbered by Alomar in OBP and\nbeat him in SLG by a lesser margin. Even putting aside any other factors,\na player with a 51 point edge in OBP is more productive than a player with\na 28 point edge in SLG. The issue has been studied before, and I doubt you\ncould come up with any convincing argument the other way.\n People see the batting average and the HR, but they don't really know  \ntheir value is worth unless they've studied the issue closely. The fact is that\nBaerga ate up a LOT more outs than Alomar; while Baerga was making outs,\nAlomar was drawing walks and being on base for Carter, Winfield et.al.", '\n\n\n\n\n\n\nIs the answer as simple as that you dislike russians???\n\n\n\n\nAnd where would canadian hockey be today without the europeans?? Dont say\nthat the european influence on the league has been all bad for the game.\nI mean, look at the way you play these days. Less fights and more hockey.\nImho, canadian hockey has had a positive curve of development since the\n70\'s when the game was more brute than beauty......\n\n\nOh, look!! You don\'t like Finns either....\n\nToo bad almost all of you northamericans originates from europe.....\n\nHmmm... And what kind of a name is Rauser. Doesn\'t sound very "canadian" to\nme. ;-)', 'Wow, this guy seems to be out to prove something to his old team, Boston.\nWhich Sweeney you ask...well, of course Bob Sweeney, the one that Boston\nlet Buffalo get a hold of (they still have 2 Sweeneys which makes things\nslightly confusing).  Game winner in OT in game 1, and another\nBIG goal (seconds after Fuhr made 3 point blank saves -> this is why\nGrant has 5 rings!!!) to put Buffalo ahead in the 3rd.  Yes, Neely countered\na minute later, but hadn\'t this course of Buffalo going ahead after being\ntied and shutting down another few great scoring opportunities, I\nthink Boston would have notched their first win of the series.\n\nWell, the Sabres haven\'t made it to the end of this series yet, but\nI certainly feel they\'ve got Boston right were they want them...actually,\nthey\'ve got them in a position that neither Buffalo nor Boston felt\nthat would come about.  One more astronomical game by Fuhr, a few more\nheroics by the rest of the team (this is a team sport afterall) and I\nthink Borque, Neely, Jouneau (sp?), and Company are gonna be swinging\na new stick (Weather is perfect for golf season) real soon.  I\'m not\ngonna waiger anything on this, because I\'ve seen some really strange\nthings happen in both pro and college hockey.\n\nTalking about golf...was that a hockey swing, golf swing or baseball\nswing that Hawerchuck used in the last shot of the game that Khmylev\ndeflected in for the BIG ONE?  The whole OT (all 1 minute of it!) was a\ntesiment to Buffalo\'s ability to really be persistent and grind it out\nin the end (something they weren\'t necessarily in the regular season).  The\nSabres pushed hard and forced Borque to blatently take down Bodger in\nthe opening seconds.  I don\'t normally like penalties being called in\nsuch ultra-critical points, but this was BLATENT.  Finally, the Sabres\nwon a faceoff (they weren\'t that hot in this dept the rest of the game)\nwhen LaFontaine scooped at the puck 3 times.  When Hawerchuck took his\nshot (quite a boomer, but Blue stopped this one) he took a few steps\nover to get his own rebound and slapped at it again, without setting\nit up.  I didn\'t realize it went in until the announcer started screaming,\n"They score, THEY SCORE!!!".  The best was seeing LaFontaine jumping\nup and down, skating a little bit, jumping some more, and then skating\nover to Brad May who he jumped on.']
# find topics similar to a key term/phrase: 
topics, similarity_scores = model.find_topics("sports", top_n = 5)
print("Most common topics:" + str(topics)) # view the numbers of the top-5 most similar topics

# print the initial contents of the most similar topics
for topic_num in topics: 
    print('\nContents from topic number: '+ str(topic_num) + '\n')
    print(model.get_topic(topic_num))
    
Most common topics:[0, 30, 117, 7, 65]

Contents from topic number: 0

[('game', 0.009178662518274271), ('team', 0.007968780811213744), ('games', 0.006334453560836352), ('players', 0.005535902899143252), ('season', 0.00547486906998158), ('hockey', 0.005363234415036895), ('play', 0.005086475485566249), ('year', 0.005000281340771309), ('25', 0.004999415937511647), ('league', 0.004376256337038951)]

Contents from topic number: 30

[('games', 0.03189057739983477), ('joystick', 0.023109556177641377), ('sega', 0.022254558047225093), ('game', 0.015150510268174322), ('cd', 0.01350687307405457), ('genesis', 0.011652039222125864), ('arcade', 0.011430880237284939), ('525 10', 0.011084315177448975), ('525', 0.01093032209397142), ('super', 0.01058430464552049)]

Contents from topic number: 117

[('helmet', 0.13130258501553513), ('liner', 0.03589565361558216), ('foam', 0.028549939482073516), ('cb', 0.02787288775568256), ('helmets', 0.019564385096043076), ('impact', 0.019275603888304765), ('bike', 0.018408318218014745), ('shoei', 0.017647319577022695), ('head', 0.01696129848156763), ('mirror', 0.01609430856704741)]

Contents from topic number: 7

[('bike', 0.01588738945736458), ('riding', 0.012032185233341505), ('car', 0.010761539949614264), ('ride', 0.010499327438537967), ('lane', 0.00898723039814874), ('driving', 0.007689840869501699), ('passenger', 0.007401980394887971), ('traffic', 0.006917656206067343), ('road', 0.006784420468337043), ('cop', 0.006577432225108643)]

Contents from topic number: 65

[('religion', 0.026246107803648847), ('schools', 0.017767044728977974), ('moment silence', 0.01631456871311663), ('public schools', 0.014647396741100374), ('moment', 0.014328639074325519), ('silence', 0.014009920008202965), ('cult', 0.013556651905168597), ('christian', 0.013547374369023344), ('prayer', 0.0123850805567994), ('religious', 0.012311105970830167)]

Saving/loading models:#

  • One of the most obvious drawbacks of using the BERTopic technique is the algorithm’s run-time. But, rather than re-running a script every time you want to conduct topic modeling analysis, you can simply save/load models!

# save your model: 
model.save("TAML_ex_model")
C:\Users\ad2we\AppData\Roaming\Python\Python39\site-packages\scipy\sparse\_index.py:146: SparseEfficiencyWarning:

Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
# load it later: 
loaded_model = BERTopic.load("TAML_ex_model")

Visualizing topics:#

  • Although the prior methods can be used to manually examine the textual contents of topics, visualizations can be an excellent way to succinctly communicate the same information.

  • Depending on the visualization, it can even reveal patterns that would be much harder/impossible to see through textual analysis - like inter-topic distance!

  • Let’s see some examples!

# Create a 2D representation of your modeled topics & their pairwise distances: 
model.visualize_topics()
# Get the words and probabilities of top topics, but in bar chart form! 
model.visualize_barchart()
# Evaluate topic similarity through a heat map: 
model.visualize_heatmap()

Conclusion#