Blog Post

Text Analysis in Digital Humanities

Text Analysis in Digital Humanities

This blog post deals with the following readings.

In all three readings, there is a general theme of using text to find patterns that can support, and even highlight, arguments and historical analysis, as well as provide a visualization for the reader of this historic research. I was particularly drawn to a statement in Cohen's article "From Babel to Knowledge." He stated "These computational methods, which allow us to find patterns, determine relationships, categorize documents, and extract information from massive corpuses, will form the basis for new tools for research in the humanities and other disciplines in the coming decades." I know that before taking this class, I had no knowledge of GIS or Google's Ingram, or even what a word cloud was. I certainly agree with my professors' statement that graduate students need these introductions to the options in digital history, not only because our world is fast moving in the technology highway, but for the pure reason that using these tools to explore the many dimensional fields of rhetoric can open some interesting possibilities with future research.

Cohen mentions utilizing an API (application programing interface) to search collections of materials to find patterns and also for categorizing by using an index or inverted index. I can see this application important, even in my own research on wartime letters. Being able to understand how to use an API will help in creating a directory of most commonly used words within these collection of letters, or categorizing letters by what words are used most. This would fall in line with Cohen's comment that as the collection grows, more information will be able to be extracted that wouldn't necessary be seen in a smaller collection. Guess I have more collecting to do.

Ted Underwood's article "Where to start with Text Mining" is a great transition from this last statement of Cohen's article. What can you do with these large collections of material once you have them? Text mining is simply using programs like "Python" to create your own program to search the collections for useful information. Underwood noted that there is a misconception that currently we have a multitude of material that has already be digitized. This is misleading as he states that "page images are not the same as clean, machine-readable text." This supports Cohen's assessment that large collections are needed to pull good research from, but it must be text that can be used in programs that can read the words. Underwood uses an example which illustrates how difficult it is sometimes for machines to read text images. The example he uses is  the long 'S' in old English, so it becomes distorted in the final product. There are projects like the Brown Women Writers which manually transcribe text. (This is something I've started on my letters, but I foresee this to be a daunting task for one person. Dr. Mack was kind enough to send this reference my way. "One way of transcribing letters:" Personal handwritten letters would be just as difficult a task for programs as the long 'S.' Perhaps in the future, our advancement in letter recognition will be able to incorporate the many styles of handwriting.

I was particularly impressed by something Underwood mentioned in his article. "Words matter." He further elaborated on rhetoric by expressing that words "hang together in interesting ways - like individual dabs of paint that together start to form a picture." What he is referencing here is the concept that words take on different dimensions that can be gleaned from using text mining. Certain phrases are just as important as particular words referenced during certain periods of time. This allows for historical contrasting, finding clusters of statements that can be analyzed by comparison over time and so forth. Within his article he hyperlinks to a PDF download at  Stanford Literary Lab. This pamphlet is called "Loudness of Word." I found this interesting as it mentions the idea of sound mechanics within novels - hearing the voices and diction. This supports the diachronic argument in research, and could most importantly support my own research of letters written during war time. Not only would there be an element of commonly used words, but diction of these words could play into a better understanding of the mentality of the author or authors during this period. Cluster and diction, as argued by Underwood, would group words to create a semantic map for visualization and understanding for the reader. While he notes that text mining may only be an "exploratory technique," it certainly gives the digital historian food for thought.

The last reading by John Theibault "Visualization ad Historical Argument" follows up nicely. Theibault emphasizes that a reader's understanding can be "enhanced by close attention to the image." This image is not necessarily a picture image or painting, but an image provided by programs processing information and placing this information in graphs or maps. He notes that there is a challenge for the new historian - how to align rhetoric with the audiences' ability to follow it. Visualization is a perfect way to overcome this, and as technology increases, historians will find many different ways to digitally visualization research.

One comment which caught my attention was his statement: "Many explanations have offered for the relative decline of social history since its heyday in the 1970s, but a failure of imagination in the integration of visualization with text based arguments may have contributed to the decline." My question here would be: would new innovations in today's digital humanities bring about a renewed interest in social history? I feel that I have popped out of cultural study of my letter collection into a more social viewpoint. (I'm still on the fence about this though). I can see, after reading some academic texts of the 1970s and 1980s, that some of the visualizations used by the authors then, did do exactly what Edward Tufte referenced by creating "chartjunk." These unnecessary additions of images and information clutter did nothing but confuse me. I would have to read the passage numerous times just to understand what the point was. Theibault's reference to Isoa Hashimoto's "2053 nuclear explosions" (which I've watched) is pure genius. The reader can really grasp what the creator of this "cinematic map" was expressing and it was simply made, but profoundly understood. I need to remember this when I begin to explore options for the final analysis of my letters.

I am certainly looking forward to diving into some of the programs referenced by these authors. It seems as the weeks go by, I am learning more than I could have ever imagined about the use of technology for research.



1 comment

These are really useful articles to have reviewed, Connie! I'm really interested in exploring text mining in my own research, but I'm also completely new to it. Thanks for sharing!