(Click to enlarge!)
Help needed with a HASTAC puzzle! We all know that this site has a huge range of users and themes of posts. In order to support our community, we have to know how members are already using it in the real world. In order to do that, I'm digging into some of the patterns of content on HASTAC - blogs, events, tags, comments, everything. But now I need YOUR help with how to explain these groups of words. In the chart above (and in this spreadsheet), you'll see 12 columns of word groups. These are organized in terms of "most frequently co-occurring words." In other words, they're the words that most often show up alongside each other. Here's where you come in: What would you title each of these categories?
I would like to figure out, just from the set of words appearing in each, exactly what each topic is. Understanding these topics will help us better understand hastac.org usage patterns, and see how the community has changed over time. The first topic, for example, is clearly "social media," which is a common subject of interest for many HASTAC members. At the end of this post I have a list of the topic names I have come up with, but I could really use input from those of you who have been here longer than I have. So I'd like to ask, if you find a topic which I've mislabeled, or if you have a better idea for a topic name, please share it in the comments!
Some more details on the process...
I started by looking at tags on "nodes" (essentially any blog, forum, event, or opportunity post), but have since begun exploring the actual textual content of nodes and comments on those nodes. This involved a fairly complicated "data wrangling" process, through which I took the content of the hastac.org Drupal data dump, cleaned it, and made it amenable to statistical analysis. Essentially, I had to turn a large .CSV into a corpus of "documents" using the R text mining package tm.
Because the node and comment text included HTML markup, and because HTML is not particularly amenable to being handled with regular expressions, I restricted the set of terms to include only those terms used somewhere on hastac.org as a "tag" or "topic." This had the additional helpful side-effect of eliminating stop words, and ensuring that documents were being represented in the most useful possible term-space. This reduced-dictionary corpus set was then stemmed, a common practice in computational text analysis that treats similar words with different endings as the same. For example, "academic," "academy," and "academia" are, in this analysis, treated as having the same semantics, regardless of their syntax.
The result of this cleaning, dictionary-reduction, and stemming was a document-term matrix, which represents each node and each comment by a vector of term counts. Think about this as a numeric word-cloud (such as this one for all tags on hastac.org), or as the matrix form of a series of 3-tuples, counting the number of occurrences of each term in each document.
The primary reason I produced this document-term matrix is for use in a latent dirichlet topic model (see these two brief introductions to the method [PDF]). The principal advantages of using LDA are that it is unsupervised, meaning that no human coding of topics or classification of terms or documents is required, and that it is probabilistic, meaning that each term and document is assigned a probability of falling into any of k topics. Since the specification of the number of topics, k, is up to the researcher, I ran a parameter sweep along k, estimating a number of similar models (with the topicmodels package) and comparing their fit. This choice is as much an art as a science, but I settled on 12 topics, as the best compromise between fit and parsimony.
The end result can be seen in the image that heads this post. Many of these 12 topics are relatively easy to characterize (another feature of LDA is that substantive labelling of topics is up to the researcher). I have also produced a spreadsheet that lists, for each topic, the 30 best-fitting terms for each topic cluster, which you can view here. Below, I offer my best attempt at discerning useful category names from these algorithmically-identified clusters of words:
- social media - fairly straightforward
- HASTAC - again, straightforward
- university programs and groups
- digital media & learning
- digital humanities - note the stemming!
- digital media
- "general" learning - clearly different from Topic 4
- cultural studies - perhaps, instead, social issues?
- scholarly publishing
- the internet
I would very much welcome your input in identifying these topics -- as noted above, this is a subjective exercise, and something that is best done in a conversation. I am hoping to use these category descriptions to characterize the evolution of hastac.org over time, as well as the interactions between individuals in an interdisciplinary network. Thanks in advance for your help!
This material is based upon work supported by the National Science Foundation under Grant Number 1243622. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.