JSTOR has been in the news recently thanks to the Aaron Swartz hacking case. While I do not wish to discuss Swartz’s actions, his possible motivations, or open access questions here, I do find it notable that JSTOR, as the organization mentions in its statement on the matter, provides an API for access to broad swaths of its data. This fact means that coders can query the JSTOR databases in many different ways such as n-grams, time lines, graphs, full text retrieval, and citation information. Moreover, it makes me wonder what purpose that could not have been served via JSTOR’s Data for Research (DfR) API that Swartz had in mind when he decided to download (via Python scripts and in the face of multiple attempts to stop him) approximately 4 million articles.
Indeed, it is a shame that few in the ensuing discussions have commented on JSTOR’s DfR interface. It is an incredibly useful tool for analyzing the massive amounts of scholarly knowledge they have stored. There are limitations to what DfR provides, though. For example, while we can download text and citations, there is almost nothing in the way of markup structuring that data. The response for a citations request distinguishes each reference, when possible, but is otherwise a primarily text-based response more suitable for human than machine consumption. Despite having an API for data access, then, much of the data returned is still derived from a model in which humans are the primary consumers.
Why is the format important? JSTOR and other scholarly repositories like it possess massive amounts of data, far too much for humans to process meaningfully. While not technically “big data”, there’s certainly enough to require computer-assisted analysis. But machines have a hard time dealing with text that lacks any clues about content or structure. Although JSTOR provides n-grams and various other charts, when dealing with full text or, more importantly, with citations, there’s little more than raw text.
Citations, in particular, are difficult to parse. There is a project in the computer sciences called CiteSeerX that is devoted to extracting citations from raw text, parsing it into machine-legible information, and then cross-referencing articles. As a (primarily) CS project, the focus is on automatic recognition using some rather advanced text analysis techniques. Even with the limited subject domain and, hence, fairly consistent style, the accuracy is in the mid to upper 90s in percentage, which is very good, but not perfect. Indeed, unless journals unanimously move to structured citation data (which I absolutely think they should), citation extraction will likely never reach 100% accuracy.
The parsing of citations may seem like a fairly arcane endeavor designed to gauge article importance via citation metrics, a measurement the sciences are far more enamored of than the humanities, but the benefits could be much greater. Take, for example, how we search for scholarly material. We use a search engine like the MLA International Database or something similar, within which we can search by author, title, journal, and keyword. This state of affairs probably seems natural to most of us by now. It is certainly useful, but it is also limited and limiting in ways that are readily apparent to anyone who has pursued a long research project. How many times have you found a crucial source months, even years, into research? How much easier would it be if we could search not by keywords, but by citations? That is, what if, once you found one article, you could discover not only who it has cited, but also who has cited it? In other words, you would be able to browse the citations as a network both backward and forward in time. The latter is nearly impossible with current tools. Consider, next, if this network of citations continued to branch throughout the literature and provide guidance as to which works were the most important in a given subject or on a given text, which authors were most regularly cited, and what were the most recent works entering into these conversations. It would quickly make keyword searching look like rubbing sticks together to get fire.
Enter Bibliopedia, recent recipient of a Digital Humanities Start-up Grant from the NEH and an earlier grant from the University of Texas’s Liberal Arts Instructional Technology Services. I am Bibliopedia's Technical Director and, in collaboration with the project’s lead developer Jason Yandell, the co-originator of the idea. Although I wrote a description of Bibliopedia last year, that was during our early phases of research and is somewhat outdated now, though generally still accurate. So, a re-introduction is in order. In brief, Bibliopedia will take the citation from scholarly articles, parse them (using code adapted from CiteSeerX for humanities works), and then convert the data into the linked data format that powers the semantic web. From there, we will then present the network of citations on a wiki-like website (powered by Drupal 7) where users can verify, edit, and extend the computer-generated information. We will start with a combination of JSTOR and the Library of Congress databases, but intend to expand to others like Project MUSE and Google Scholar. We will also, in future rounds of development, accept PDFs, Zotero libraries, and other forms of data. By aggregating, parsing, and linking all these disparate, primarily text-based sources of humanities scholarship, Bibliopedia will greatly simplify research for everyone from novices to experts.
Numerous other benefits will accrue to our project by virtue of choosing a linked data format. Visualizations (network maps, timelines, etc.), interoperability, data re-use, and many other abilities are hallmarks of the semantic web. Anyone who wishes to will be able to query our data and join it with other linked data efforts. Moreover, queries themselves become not only easier and more effective, but the semantic web also enables predictive searches and targeted suggestions. Because this format allows machines to understand the relationships among the data it holds, programs can make inferences and see patterns otherwise impossible to find.
Within a year, we will have a working prototype of Bibliopedia up and running. I hope you will join us to test, comment, and improve it. I can’t wait for it be ready. I have already wished I had it running many, many times during my own dissertation research.
If you’re interested in learning more about the semantic web, join the Semantic Web group here on HASTAC.