Clouds and Corpora: Reading Text through Keyword Frequency Tools

Blogged by: Tatiana Servin

Upon the advent of words and phrases like: keyword frequency, word analysis, or deconstruction, (basically any word or phrasing suggesting word counting or scrutinizing) my brain rifles for the memory cloister harboring late night readings and class discussions on Derrida for graduate courses on theory. I am reminded of the attention paid to the reading process itself, and by default the text itself (what does the text alone tell the reader, independent of what the reader may pour into it by virtue of subjectivity). Deconstruction, as it relates to the process of reading and to the nature of a text, spurs questions such as:

  • What can a corpus of text tell the reader about the work?
  • What features matter in the analysis of text?
  • Can looking at a word count, furthermore the most frequently used words, reveal a general idea of what that said text will encompass?

The final question opens a number of sub-inquiries for me. If I were to combine a few articles, and submit the corpus of text for word count analysis, I would only have an analysis of word frequency. Meaning, I could conclude from the product that the same words appeared most frequently in the sum of the texts, but I could not distinguish whether or not each author was using the word in the same context.  Let’s turn to an example.

Putting Wordle into Practice

Using Wordle, a more visual tool than a word frequency list, I combined the text of the following articles:

Looking at the Wordle cloud on the top of post, I could deduce that as a whole the five articles were generally about the words (in order of highest frequency):

  1. “digital” “humanities”
  2. “technology”, “may”
  3. “values”, “research”, “community”, “new”, “tools”, “rather”

Without going into the nuances of the text analysis (as least not yet), I could start to conclude that these texts are commonly discussing the digital humanities, which probably encompasses technology. Furthermore, the word “may” leads me to think that the articles are discussing possibilities, potential, and to stay with this alliteration, a panorama of prospect.  The frequency of the word “rather” could possibly mean that the authors were discussing the digital humanities in the context of clarifying its definition, or rather suggesting alternatives in “research” methods or “tools,” or—.

Using Wordle to analyze the combination of articles by Kirschenbaum, Svenson, Bush, Spiro, Cohen and Rosenzweig, key themes began to emerge. Digital Humanities was somewhat indirectly the who of my collection of text, technology the what and where, lastly, the importance of research, community, and possibilities provided the why. Generally. I say generally because it takes Derrida’s deconstructor to look at iteration in terms of context, split meanings, etymological relationships in each of the articles.

Below is a word cloud created out of the bibliographies from the same collection of articles: 

Analyzing a collection of sources, rather than a collection of articles, I let whatever initial reactions or impressions I had to the image dictate my analysis. The words “accessed,” “web,”  “link” resonated with me, and led to my search for what words were not present. The word “print” is nonexistent. The authors were all primarily using online sources. Wordle allowed for an almost instantaneous finding of the most common type of sources in a lengthy list of citations. In addition, I could view other types of quantitative data such as who the most commonly cited author was in the collection of articles.

For quantitative data, Wordle is a toy one can use to play with word frequency analysis. Qualitatively, Wordle provides a solid starting point for the reader. It offers the reader the equivalent of a preview to a film: you have a general idea of what the film will be about, but you are going to need to see the film in order to confirm or expose any initial presumptions.

