Blog Post

Crowdsourcing Literary Textual Analysis Projects

Hello everyone! This is my first official HASTAC blog post. I’m a PhD student in English at the University of Iowa. As of fall 2011, I’m working toward my comprehensive exams, doing long 19th-century British literature. I’m interested in nationhood and national identity, imperialism, Gothic literature, nautical literature and culture, decadent poetry, art history and theory, the history of science and medicine… too many things. (Almost anything, really.) At this point, I think my dissertation will focus chiefly on British anxieties of nationality and nationhood post-French Revolution.

As far as my digital experience goes, I don’t have much depth, but I have a certain amount of breadth—thankfully, “digital humanities” seem to include almost anything. From 2008-2010, I blogged about green technology for PCMag’s (My favorite experience there was interviewing David de Rothschild about sailing a ship made of plastic bottles across the Pacific—a nice mix of the Victorian, nautical, and technological!) Before I came to Iowa, I worked as associate editor at Law Technology News magazine, where we covered new legal technologies and software, including strides forward in e-discovery technologies—the same word search and analysis technologies that I was surprised to re-encounter when I came back to academia. 

This summer, I attended the Digital Humanities Summer Institute at the beautiful University of Victoria, where I took a course in “Out-of-the-Box Text Analysis for the Digital Humanities” with NYU’s David Hoover. At DHSI, I also attended at least a series of interesting presentations about the sorts of literary/digital work scholars are doing in building databases, text markup, digital preservation of archives, GIS mapping…

 So… what to think, then, when I came out of DHSI with a series of doubts as to much of the literary-digital work that was being done?

Not all of it, of course—digital preservation and many other things seemed very useful. But when I saw a presentation about text markup, or a project that visualized social and literary connections, what I saw was highly-trained PhDs spending hundreds or thousands of hours (often of their "spare time") on what amounted to data entry, for projects that were incompatible with each other (a project visualizing one writer’s social world through her letters and diaries, for example, might be incompatible with one visualizing her husband’s) or specific to a single goal (text markup projects were often not public, or marked-up in what is an inevitably subjective or goal-oriented process).  

Maybe it’s coming at it from a legal technology industry angle. But my impression is that the goals of massive e-discovery projects—in which lawyers have to sort out relevant documents from what might be hundreds of thousands of files (corporate e-mails, files, images, etc.)—rely heavily on software that is specialized to de-duplicate, find relevant terms, sort, organize, and process. When files have to be sorted manually, such a sorting might easily involve teams of a dozen or more paralegals working around the clock for weeks or months on end.

Of course, directly comparing the resources of academics to those of corporate law firms is unfair. There are hundreds of programmers and dozens of programming companies working to create this sort of software for law firms and corporate legal departments, and it’s a multi-billion dollar industry. This sort of money and software is unavailable to academics wanting a visual look at, say, civil war correspondence. (Given the cultural value in preserving this sort of correspondence, it’s easy to insert a “shamefully” before “unavailable” without feeling too guilty of bias.)

But we could nonetheless learn from it. If we look at “Victorian literature” as a single “case”… or even single authors as “cases” in themselves--then it might be time to start envisioning larger digital/literary textual projects with more people involved from the start. The practical difficulties with collaboration over distance have been all but eliminated in a world where we use email to communicate with our colleagues in the next office.  

Many projects have been doing this successfully for years: The Victorian Web comes to mind, as does Nineteenth-century Scholarship Online (NINES): large-scale historical projects, to which many scholars have contributed over years. And more public crowdsourcing has been done successfully in all sorts of literary digital projects—the University of Iowa is having great success with a its Civil War Diaries Transcription Project.

But perhaps we could now apply the same type of large-scale, open-to-the-public thinking to more specific textual analysis projects as well—text markup and social network visualization, for example. Many of us can’t assemble teams of dozens of paralegals (or rather, graduate students) to work a case—or can we? Voltaire wrote 18,000 letters; a small group of scholars could take years to even sort them. Why can’t we have scholars, graduate students, undergraduates and interested members of the public working on such projects? It would come at some cost of control and precision, of course, but given the sizes of the archives we’re dealing with, it might well be worth it. And heaven knows that digital-literary projects require lots of planning and programming at the start--but building in the capacity for public participation might pay dividends in the long run, both in getting work done more quickly, and also in engaging different communities in scholarly work. 

I'd love to hear tips on where this sort of work is being done, or what people have encountered when trying it. Any leads, HASTAC community?



Great point bringing up this issue of needing a clear understanding of available resources and anticipated scope of work (SOW). There are some efforts happening right at The University of Iowa to create learning opportunities for undergrads by working with archives and sources to add their contributions to digital projects. (Their contributions = metadata, original content, etc.) That's not quite like going public, but it is turning crowdsourcing into learning outcomes. And from what I've heard, the students were pretty enthusiastic about getting to contribute. 


I'd love to hear more about getting undergrads involved! Some of my friends have emphasized community-outreach / original research in their classes, and it sounds like it can be both productive and rewarding for the undergrads, who can see themselves as part of a larger scholarly community (as well they should!)


Have you looked at the Dickinson Editing Collective as a model?


I'm not sure how active they are, but they brought out a digital publication based on her letters and poems in 2008.


That looks like a fabulous resource--thanks! It looks a lot like something I was trying to envision--something that gets the public involved.

But they do ask that potential contributors write in and request to get access to different parts of the site---I would love to see something that combined a site like this one--one that collects and makes available the scholarly results of transcription and digital collections--with totally open-access contribution like this sort of thing:

I think it would be such a productive combination... assuming you could get the public interested. But, if the possibility was there, I think you'd probably get enough participation to justify the effort.