Eighteenth Century Collections Online (ECCO), one of the largest academic research collections available online, includes more than 180,000 texts published between 1700 and 1799. Imagine accessing a pdf document image of 250 year old texts—with a few clicks, you can see a page from Johnson’s dictionary or the title page of Richardson’s Clarissa, compare eleven different editions of your favorite poetry collection or do a keyword search on an 800 page novel. ECCO has radically changed the way many eighteenth-century literature specialists both research and teach.
What if there were an open-access version of ECCO? This is the question asked by Professor Laura Mandell, an eighteenth-century British literature scholar at Miami University in Ohio. Laura answered this question with 18thConnect (http://www.18thConnect.org).
Partnering with Gale, a division of Cengage Learning that owns ECCO, 18thConnect has announced that it will offer access to plain-text versions of Gale’s collection that will increase access and searchability. Using an NEH-funded supercomputer to generate cleaner text, 18thConnect will offer improved translation of various curiosities of eighteenth-century typescript, including the infamous long “s” that is suspiciously similar to an “f”.” In addition, grant awarded to Miami University of Ohio from the Mellon Foundation will allow 18thConnect to design a crowd-sourced correction tool, which will be available to registered users in January.
I had the opportunity to talk with Laura about 18thConnect’s release, and you can learn more about the project and its significance for the future of archival research below.
What are the biggest challenges of doing historical research in a digital age?
I think that the biggest challenge has to do with searching, in particular with understanding the implications of finding and not finding things. So for instance, Renaissance scholars will often summarize their research by saying, "there are no pamphlets that represent X." As an 18th-century scholar, I was always amazed by this capacity to know exhaustively: how could you be sure that there wasn't something you hadn't found? I worry, though, that this same kind of statement will be made when scholars search EEBO, for instance. The About EEBO page says, "this incomparable collection now contains about 100,000 of over 125,000 titles listed in Pollard & Redgrave's Short-Title Catalogue (1475-1640)."
What it doesn't tell you, however, is that any time you search EEBO, you are only searching 25,000 of those texts because page images cannot be searched. I say "only 25,000," and yet that is an incredible achievement: the Text Creation Partnership at the University of Michigan typed and checked all those texts at enormous cost. But the database itself as accessed through any library catalog doesn't say ANYWHERE "You are only searching 25,000 of these 100,000 texts." Will scholars now start to say, "the only thing to be found in the archive" about the results they find in EEBO?
The ECCO text collection is better because ALL 182,000 documents were mechanically typed -- so at least you are searching some portion of the whole set. But ALL? No. Right now, the problems with mechanically typed text mean that you cannot draw significant conclusions from finding or not finding things in ECCO. Do people know that? I don't think they do.
How is 18th Connect different from existing digital archive databases?
18thConnect isn't itself really a database: what it does is collect the materials from individual databases into one large set: it "aggregates" databases, to use the current lingo. One value to that is that it allows one-stop shopping: you can search ECCO and the ESTC at the same time, for instance, having all their results returned in one list. But it goes beyond that: we are scholar-driven and organized, so we impose shared standards on database materials, peer-reviewing materials and insisting upon a certain kind of data integrity. So for instance, we have to grapple with the problem of improving ECCO's mechanically-typed texts before we can ingest them for full-text searching. This brings scholars to the table, so to speak, in conversations about preserving the archive for the future, which is crucial because some of the problems can be solved by things we know and know how to do, by expert historical knowledge that goes beyond the knowledge of librarians and computer scientists. No one of us has all the knowledge we need, of course, but as a community, we can make serious contributions: 18thConnect will serve as a sort of go-between, making sure that scholarly expertise has the impact that it should have on shaping the future archive. It will all be digitized. And what that means is, even if backed up by texts like gold in vaults, people --scholars of the future -- will use digital surrogates. We have to stand up now to insure the integrity, interoperability, and findability, of those surrogates, and to contextualize their meanings.
Who will benefit most from 18th Connect?
I think it will be literary and historical scholars, but one can never be sure about these things. We're modeling our work of correcting texts on the Australian Newspaper Digitisation Project: users have come to their site and have corrected 30 million articles over three years -- it's amazing. But it turns out not to be historians who are the primary users: it is the genealogists. The materials we are helping to correct with our new crowd-sourced correction tool, the materials in ECCO, are varied -- I could see lawyers being big users, for instance.
What are the next steps for your project?
After we finish re-OCRing (re-mechanically-typing) the Gale materials and building our crowd-sourced correction tool, and getting people on board to help us improve the scholarly archive for the future, then we'll start working with data-mining, teaching people how to do it, what kinds of things it can show, what kinds of things it cannot show
(My dream -- admittedly a literary one -- is 18thConnect on a screen in front of you, colleagues, and graduate students: you set up data-mining procedures to find all texts with the features of the genre "epic," and suddenly you notice that twenty texts have all the features. These must be the most formulaic epics of all, so you touch the dot on the graph marking the twenty -- it expands out to a little solar system of text titles -- one of them is a Romance by Mme. de Scudery, translated into English; you touch that title and the whole text appears on the screen for your perusal, its generic features marked. Are they parodied? you click on "find this feature in selected texts," and paragraphs from each of the original twenty spring up around your central Romance text. You read through and analyze them together, compare passages where something generic is being declared or mocked.)
How might this model extend to other time periods, or even other disciplines?
NINES was the first online finding aid / scholarly community / digital aggregator on the scene, and that model has been extended to 18thConnect. So interesting to me is that, even though the data for NINES and 18thConnect overlap (you can search both at both sites), the central focus and distinctive feel of each group differs. There are amazing 18th-century databases, but fewer individual editions; NINES has peer-reviewed and is packed with materials from so many major-author sites (Whitman, Blake, Rossetti), whereas 18thConnect gathers data less exclusively literary, more historical. And of course the central focus for 18thConnect, right now, at least, is trying to correct the data that makes the ECCO collection searchable. But we help each other: anything NINES builds can be used in 18thConnect, and vice versa. Also, on the horizon, though I can't say much about it at this point, are plans to export the model to other periods as well, and to make them all interoperable.
Oiy, that question. O.k., here goes: for literature professors especially, given that our discipline arose with mass printing, the terms we use and questions we ask are deeply tied to the printed codex form. A very straightforward instance of this fact: there are certain questions that we just don't ask because we know that they cannot be answered. What if you could ask, "in all the novels published from 1660 to 1920, is this phrase ever used?" No one asks such things now -- it would be ludicrous. So ultimately, I would imagine, research will become conditioned by algorithmic and statistical ways of moving through written materials. It won't be better, it will be different, and some things that can be done with the printed codex are unsurpassable. We'll know new things, but we'll also come back to where we started, and know the place for the first time.
About Laura Mandell
Laura Mandell has published Misogynous Economies: The Business of Literature in Eighteenth-Century Britain (1999), a Longman Cultural Edition of The Castle of Otranto and Man of Feeling, and numerous articles primarily about eighteenth-century women writers. Her recent article in New Literary History describes how digital work can be used to conduct research into conceptions informing the writing and printing of eighteenth-century poetry. That article forms part of a book manuscript in progress: “Carved in Breath: Technology and Affect in Gothic Fiction and Romantic Poetry.” She is Editor of the Poetess Archive, an online scholarly edition and database of women poets, 1750-1900 (http://unixgen.muohio.edu/~poetess); Associate Director of NINES (http://www.nines.org); and is currently participating in the development of 18thConnect, a similar online network for eighteenth-century scholars. Her current research involves developing new methods for visualizing poetry (http://miamichat.wordpress.com), developing software that will allow all scholars to deep-code documents for datamining, and improving OCR software for early modern and 18th-c. texts via high performance and cluster computing.
About Cengage Learning
Cengage Learning is a leading provider of innovative teaching, learning and research solutions for the academic, professional and library markets worldwide. Gale, part of Cengage Learning, serves the world's information and education needs through its vast and dynamic content pools, which are used by students and consumers in their libraries, schools and on the Internet. It is best known for the accuracy, breadth and convenience of its data, addressing all types of information needs – from homework help to health questions to business profiles – in a variety of formats. For more information, visit www.cengage.com or www.gale.cengage.com.
18thConnect is a community of scholars dedicated to peer-reviewing digital scholarship and gathering together the best electronic resources available in the field of eighteenth-century studies. The 18thConnect.org web site is thus an online finding aid, a first stop for scholars searching for information in the field, providing on the My18 page mechanisms for saving searches, gathering texts to correct, tagging, note-taking, and composing. It is a sister-organization to NINES. Please see www.18thConnect.org and www.nines.org.