Blog Post

Data Mining: The Perils of Text Analysis

I remember it well. It was a nice fall day. First semester was now at full throttle and everyone was busy at work. I had been working on my first textual analysis of selected writings from Henry David Thoreau and Susanna Moodie. I wanted to discover patterns between the two works in regards to the Canadian Landscape. The result showed the beauty of metaphors interwoven among the texts, as well as the uncertainty of a new life in such a vast wilderness. Data mining was itself a landscape - one that I quickly endeavoured to explore.

Such my hopes were initially satisfied. One of my professors, Dr. Graham, had a new project that he asked me to assist with. We would take the massive, albeit famous, volumes of former Canadian Prime Minister William Lyon Mackenzie King's diaries and perform a textual analysis. Who knows what we would discover? Maybe a hidden transcript once unknown to history. We might become famous overnight. History was in our hands. 

I started immediately by going to the Library and Archives of Canada's online site. I would input a massive wall of text into Voyant Tools and the computer would output a beuatiful array of numbers and sequences, revealing a pattern that I would be the first to realize. Here we were. The making of history was literally keystrokes away. Yet, where were the diaries? I mean, yeah, there were images on the screen and I could read his diary, but that was of no help to me. Where was the plain text? I searched and searched but to no avail. My envisioned making of history was nowhere to be found in my future. The project was quickly abandoned.

Hyperbole aside (and I hope you enjoyed this embellished story), my account contains some level of truth. Let me explain...

Dr. Graham and I really did attempt a textual analysis of the former PM's diaries but ultimately failed, though not without a lesson and hope for the future. See, text analysis is a very wonderful tool on the digital historians belt. You can close read pages upon pages of text in a matter of moments. Our old ways of finding pattern, metaphor, and meaning in great literature, albeit heroic due to the sheer volume of many works, are beginning to evolve parallel to our technology. Textual analysis - a form of data mining - allows you to view the words in and out of context and digest it all in a timely manner once only dreamed of. Our hope was to find new patterns in our analysis of the diaries as textual analysis reveals what traditional readings may have missed. Not that all those readings of literature are invalid, rather we are building upon previous methods. 

We did attempt to use free Optical Character Recognition (OCR) software to turn the image of the diaries into plain text, but this ultimately failed. The technology (at least the free ones) are not necessarily perfect yet. The program spit out a large amount of gibberish and the alternative (typing out the entire diary) was unthinkable - no one (unless it was one's job) would have time for this. We had to abandon the project. 

But this does not in the least worry me, for our attempt must wait for a later epoch, though one that I can see on the horizon. This failed attempt is one of many case studies of accessibility in the digital humanities. My hope of course is in progress. Technology increases at such incredible rates that projects such as ours, once dwarfed by the technology, soon become possible. We cannot begin to comprehend the speed at which technology increases. (Take for example Moore's Law which states that processing power increases exponentially. Also, some argue that that exponential increase also increases exponentially.) Thus our project must wait a few years. I suspect this will be possible in the very near future when free OCR technology can transcribe the diaries perfectly, without error (or at least until someone transcribes the documents). 

Who knows? We may one day be making history with HAL 9000. Let's just hope the practice is better than the theory for this one.



Rob, this is an excellent post, and as John Unsworth, Jerome McGann, and others have pointed out, we can learn a lot from our failures, and this is especially true in cases of text mining and analysis. One thing I'd like to point out is that, depending on the size of your corpus, it's actually OK to work with data that contains a certain amount of noise. I understand the temptation to strive for perfect data—I'm currently working with a very small corpus, so am spending a lot of time manually correcting OCR—but I don't think that should hold you back. Of course, this brings up the question of how much noise is permissible given the size of your corpus, and those questions are better left to others. Also, I don't know the specifics of how large the percentage of your OCR was gibberish. If you're comfortable with the Python programming language, you may want to check out the new Uses of Scale project from Ted Underwood et al.; part of what this project is doing is collocating massive lists of common OCR errors that can be corrected via Python code. 

Anyway, just thought I should add this note, and I should also add, I am a fan of ABBYY Fine Reader which does an amazing job with OCR (though the quality is always going to be dependent on your the quality of the scans you are working with) and it allows you too easily check and edit any errors it perceives. It is definitely not free software, but it may be worth considering, or checking with your library to see if they have a subscription. At Northwestern, we have a machine with ABBYY installed for faculty and graduate students to use free of charge, and I'm guessing many libraries with digital collections might too. Good luck!



I really appreciate your insight as well as all of the links. I will definitely check them out.

I had never really given any thought to a certain amount of noise as permissible. As you mention though for such a large amount of literature, most of which was gibberish, it is difficult but that is a great insight. I have never heard of Abby Fine Reader before but it does seem more useful as well. 

Many thanks,