When reading Ted Underwood's "Where to start with text mining," I was longing for a definition of text mining at the beginning. It felt like Underwood dove into the topic without giving even a brief summary of it.
Underwood highlights that a large collection of texts is usually necessary in order for the analysis to have context. This could be seen as a bad thing because gathering a lot of data seems like a lot of work. He goes on to talk about programming skills, and whether they are necessary. He beats around the bush a bit, saying he can't really answer the question, but then gives an anecdote alluding to programming skills being necessary. I would say everyone should learn programming whether or not they have to, but I may be biased as a computer science major!
Text mining can allow you to do these things and more:
- Categorize documents
- Trace the history of particular features (words or phrases) over time
These all seem to be useful things to get out of text mining, and this list is not exhaustive.
In "Topic Modeling: A Basic Introduction," Megan R. Brett starts with a definition. "Topic modeling is a form of text mining, a way of identifying patterns in a corpus." She also gives a useful way to think about topic modeling "One way to think about how the process of topic modeling works is to imagine working through an article with a set of highlighters. As you read through the article, you use a different color for the key words of themes within the paper as you come across them. When you were done, you could copy out the words as grouped by the color you assigned them. That list of words is a topic, and each color represents a different topic."
Brett agrees with Underwood, that a large corpus is usually preferrable, and you need to know how to use your tools, and understand your results.
It seems like we might try out topic modeling for ourselves, and form our own opinions as we go through the process!