When I first heard about topic modeling, I joked I would use the method to prepare for my candidacy exams. Without thinking about it, I ended up doing just that. Sort of. I’m not avoiding reading items on my exam lists, but I’m finding specific items that I can expect to be worth dedicating time to reading even if they did not make it to my lists. Without too much effort or time, I generated a website that allows users to view issues of the Tatler, an eighteenth-century periodical, with cross-references between texts that share common vocabularies. This post describes my methods and the final payout—a time-saving device for isolating and describing features of texts I know only a tiny bit about in order to decide, through a reasonably educated guess, which items I can expect to be worth my time reading.
I’ve found myself in a slight time crunch in preparing for my candidacy exams. Specifically, I’m finding that it would be helpful to be more familiar with a number of the pieces left off the lists of texts. The eighteenth-century periodical put out by Joseph Addison and Richard Steele from 1711-1712, the Spectator, made it to my lists, but Addison’s earlier project, the Tatler, was cut during the list-editing process. The two periodicals are quite important. As Erin Mackie puts it in Market à la Mode: Fashion, Commodity, and Gender in the Tatler and the Spectator, they serve as “a complete documentation of life in early-eighteenth-century England” (Mackie xv). I wanted a better understanding of the shift (if any) from the Tatler to the Spectator, the latter of which I’ve read, but I simply could not justify investing the time into the former when it wasn’t immediately on my exams. In order to get a better grasp of it, I decided to topic model the first two hundred issues of the Tatler. Originally, I had two motivations for this: first, the topic model would give me some idea of the latent topics or discourses within the Tatler as a whole; second, I’d be able to see what issues were comprised of topics that might be immediately pertinent to my dissertation project, and, as a result, I’d be able to devote time to these issues while expecting a reasonable payout.
The Tatler is well suited for topic modeling. It’s a compact corpus, already “chunked” into separate issues, and freely available over the web. Furthermore, I wasn’t going into this blind; I had some idea of its content from reading secondary texts. Organizing the output based on topic frequency (I opened the file in Excel, and sorted the second column based on value), I was able to get the gist of what themes were prevalent in the corpus. The most prevalent topic (98: great place found made company told day account discourse end), did not tell me much, but in combination with the next three most common topics I can see there seems to be a discussion on the difference between country and city, as well as the role of the individual.
This is, I think, one possible way in which topic modeling can be useful. It acts like a focused search engine for documents. While having some idea of what you’re searching for remains a problem, this problem pops up in most research situations. Furthermore, it gives a rough idea of the entirety of the corpus in addition to the ability to isolate particular documents that may be pertinent. I admit this may not be worth the trouble for these early, limited quantities of the Tatler, but if you were working on a larger corpus, say the introductory text from a few thousand cookbooks or a significant number of diary entries, I think the process would be helpful, save some time, and allow one to scratch a research itch without submitting to the black hole of research.
All that said, I don’t like topic modeling. I don’t feel I have a good enough grasp of the mathematics behind it (nor the time to devote to understanding it), so I’m hesitant to make any claims about the accuracy of its results. Nevertheless, by reading through 1,000 words organized by topic, I can get a rough idea of what was happening in the texts, and, by focusing primarily on a few topics, I was able to isolate a few issues worth reading. Topic 89, which included the words beef, diet, eat, toast, mutton, and dishes, seemed particularly relevant to my interest in food, so I was able to see that Tatler “No. 148” may be worthwhile. Similarly, topic 50 (fashion), 56 (table, tables, swift), and 68 (wine, liquors, taste, art) seemed relevant. From there, I was able to hone down to these issues: 67, 74, 77, 78, 81, 84, 86, 98, 113, 131, 142, 147, and 148. This narrowed my reading from 193 issues to 13. That’s doable. That’s easily doable.
To begin, I grabbed available issues of the Tatler
off Project Gutenberg
. While these were already in plain text, I did have to do some minor clean up: I removed all the copyright information in the header and footer of each volume, combined the remaining text into a single file, removed all the footnotes, and did a search for “No. ” to find the beginning of each issue (as they’re numbered “No. 1,” “No. 2,” etc.. Using Sublime Text’s
multiple cursors function, I was able to insert “&&&&&&” into the file right before each issue, in order to facilitate breaking it up later via the python script below:
import os, string, nltk, sys
with open('tat.txt') as f:
lines = f.read()
lines = lines.lower()
lines = lines.encode('utf-8').strip()
lines = lines.replace("'","")
for x in lines.split('&&&&&&&&&'):
x = 0
while x < len(n):
out = "title"+str(x)+" "
for y in n[x]:
out += y
name = "tat"+str(x)
name += ".txt"
f = open("rawr/tat.all/"+name, 'w+')
x += 1
Once everything was broken into individual files, I began the actual process of topic modeling. I used MALLET
, an open source tool for text analysis. After importing the documents,
bin/mallet import-dir --input /Users/jessemenn/Desktop/hastac/tatler/rawr/tat.all --output corpus.mallet --keep-sequence --remove-stopwords
I ran the actual command to perform the topic modeling:
bin/mallet train-topics --input corpus.mallet --num-topics 100 --optimize-interval 10 --output-state /Users/jessemenn/Desktop/hastac/tatler/rawr/topic_state.gz --output-topic-keys /Users/jessemenn/Desktop/hastac/tatler/rawr/topic_keys.txt --output-doc-topics /Users/jessemenn/Desktop/hastac/tatler/rawr/doc_topics.txt
With the results dumped out to the appropriate folders, I turned to NetworkedCorpus
to produce an accessible way of seeing what topics were present in which issues. I followed the directions from the GitHub page, and, got an encoding error. After a lot of editing, re-doing the topic modeling, removing text from the original Tatler
pages, reading json, searching Stack Overflow, and sobbing quietly at my kitchen table, I realized the problem was due to Mac OS X automatically creating a .DS_STORE file in the directory containing the MALLET output. After deleting the file, it worked wonderfully, and I was left with .