Blog Post

Towards a Future of Humanities Research -- Bibliopedia, Linked Data, and the Problem with Silos

Towards a Future of Humanities Research -- Bibliopedia, Linked Data, and the Problem with Silos

 



HASTAC 2013 was so great. Fiona asked us all to post our talks, so here's mine, warts and all.
----
 

Bibliopedia is a tool for the discovery and analysis of humanities research, and a platform where scholarly communities can form to discuss, revise, and extend our knowledge of existing research. Conceived in part from the 2009 “Digital Textuality and Tools” HASTAC Scholars Forum and in part through long conversations among the project’s participants, Bibliopedia recently completed development of a prototype thanks to an NEH Digital Humanities Start-Up Grant. The two main project members (we had a very small team) were myself and Jason Yandell, the project's co-creator and an expert coder with 20 years of experience. Geraldine Heng, an associate professor at the University of Texas at Austin served as the PI, provided encouragement, feedback, and advice, and graciously let Jason and me develop the project as we saw fit. So, in essence, this has been a two-person collaboration, which I think is a testament to the things we can achieve in the digital humanities when we use existing and emerging standards and mature, open source software whenever possible. Before I discuss the details of Bibliopedia, however, I want to discuss the existing tools for discovering humanities scholarship to explain why we need such a tool.

Imagine during your researches you find an important article published in the 1970s and you would like to find who, in the intervening decades, has cited this work. This task can be incredibly difficult to nearly impossible with existing tools. You can try the Arts and Humanities Citation Index, a product of Thomson-Reuters, but its coverage is woefully lacking.

You can try Google Scholar, with much better results, especially now that they index JSTOR and Project MUSE, but many limitations remain. To begin, Google Scholar provides no way to analyze and visualize the relationships among research products, to discuss entries, or to import and export large amounts of data from other databases. I'm certain Google has the ability to do all these things internally, but they have yet to provide any APIs, visualizations, or discussion platforms to the regular user. Moreover, Google is, as we all know, an ad-delivery company. They provide many wonderful services for the low cost of being served ads and tracking our information, but they are not a scholarly research platform, and are especially not designed for the humanities. We need something designed by humanists and for humanists that responds to our unique needs, that is open access, and that allows for novel uses of the data by any interested parties. None of the existing services provide these things, nor are they likely to; there is money to made in maintaining one's own silo of useful data.

Furthermore, for the humanities, citation metrics—even if we had the level of coverage that the sciences have—are often not relevant to our research. Even most of the work in the exciting area of altmetrics focuses on sciences. What we need is a tool that helps us discover, contextualize, and discuss scholarship, not just see how many times it has been cited or how influential a given journal may be. The need to contextualize and interrogate research is far more important than the ability to generate bibliographies or find the most popular articles. Indeed, humanistic scholarship often seeks out the obscure both for comprehensiveness and for insight. Citation metrics, impact factor, and other quantitative measures of scholarship derived from the practices in the sciences are helpful for humanities scholars, but remain of little value for the research process. They are little more than a starting point. Words, Google Scholar and other citation indices, while useful for browsing related scholarship, skew towards the most cited works, perpetuate the errors and gaps inevitable in machine-extracted data, allow for no community discussion, and typically don't permit programmatic access to the data. How, for example, does a student new to a field use these tools to find the most influential scholarship on Sir Gawain and the Green Knight? How does this student determine the different subject areas covered by scholarship on this text? What of an understanding of the existing scholarly landscape? The results from Google Scholar provide almost no help. And so, we're left with the usual methods: find a book or an article that looks relevant, read it, then begin reading the works cited. This process works fairly well, as we all know, except that it requires covering ground that many scholars have already trod. What if we could collect all that knowledge in a single place and provide some directions for the traveller?

Another problem, which Google Scholar somewhat but not fully alleviates, is the problem of data silos. As we know from discussions around Open Access publishing, a handful of large corporations control most of the access to scholarship: Elsevier, Thomson-Reuters, and the like. They have a strong financial interest in limiting access to the publications they control; they charge libraries outlandish fees to search and display the scholarship that we write more or less for free. Their entire business models are based on maintaining scholarship in closed silos of information that they will charge you to access. We do also have resources like JSTOR and Project MUSE, both of which are excellent non-profit aggregators of scholarship. JSTOR has even gone so far as to provide their Data for Research portal, which provides some citation metrics and text analytics capabilities as well as the ability to download bulk data. JSTOR used to offer an API, but, as I discovered one day when trying to crawl it for Bibliopedia, they have since taken it offline. Project MUSE has been talking about providing an API for several years now, but has yet to do so. So, both of these services are on the way towards better interoperability, but they remain for the moment data silos of their own, though I remain hopeful that this situation will change. Nevertheless, with so much of our scholarship locked away behind these digital walls, our research process suffers. We run searches on multiple databases—whatever our library happens to subscribe to—find a few interesting pieces, and start reading. What if, instead, we could see the relationships among the works in all these silos?

The semantic web is transforming the Internet from a collection of pages and data readable only by humans to one that machines can understand and process. Semantic web technology promises the ability automatically to infer connections among different elements, thereby vastly improving search capabilities, discovery of new information, and the overall usefulness of the Internet. Just as information accessible only to humans comprises the great majority of the general Internet, so too is data about scholarly literature locked away in strings of text that computers cannot process without great difficulty. At best, search engines for repositories such as JSTOR permit researchers to query author name, journal titles, and keywords, but once a work is found, the search stops. Although Google Scholar, JSTOR, and other services attempt to show citations of articles, their usefulness are highly limited because the do not make clear the relationships among articles, present very limited metadata about each article (if any), and fail to provide for community elaboration or correction. Yet despite these limitations, such tools stand as a significant advance beyond the keyword-based search engines we're used to.

This is where Bibliopedia comes in. Bibliopedia provides an infrastructure for advanced data-mining and cross-referencing of scholarly literature to create a humanities-centered collaboration and research discovery platform. As Cathy Davidson noted in her opening keynote, when we have infrastructure in place, we can do some amazing things. Designed for modularity and extensibility, it can search any resources that provide an Application Programming Interface (API) for data access, thus breaking down the walls of these different silos. APIs are crucially important because they allow programmatic access to bulk data for novel uses, in essence allowing us to remix and extend the data of others. Bibliopedia's prototype phase focused on crawling JSTOR for metadata about scholarly articles that cover The Travels of Sir John Mandeville; Bibliopedia then examines the articles for citations and saves the results in a publicly accessible database. The platform also enables human-machine collaboration to discover specific mentions of locations and citations in the critical literature. Most importantly, Biblopedia uses open source software to perform automated text analysis, citation extraction, and cross-referencing of the relationships among texts and authors. It then transforms this information into linked open data, which enables network visualizations and citation metrics while allowing for new and unforeseen uses of the data.

Bibliopedia begins by assuming the importance of linked open data, machine-friendly text formats, and open access publishing for the realization of the potential of existing digital tools and technologies, and to spur innovation and research in the humanities. The primary innovations Bibliopedia achieves are: 1) the aggregation and cross-referencing of separate silos of scholarly data; 2) the transformation of that information into a format consistent with the semantic web; and 3) crowd-sourcing the verification and elaboration of that data. Mapping and cross-referencing large-scale volumes of scholarship also means that unexpected connections can be found and brought to light, along with less-known original works that might otherwise remain unread. Moreover, formatting scholarly references for the semantic web will make this data available to a far broader community and enable unexpected innovations. Eventually, Bibliopedia will generate custom bibliographies and visualizations based on search results, facilitating a wide variety of scholarly inquiry and discovery. Most importantly, Bibliopedia is designed for ease of use, in order to attract the largest possible range of humanities scholars as its user base, in particular scholars who may not normally use digital tools.

But, as you can imagine, we ran into some roadblocks. The most notable problem was the loss of the JSTOR API, which went away for vague “security reasons”. This challenge, then, touches upon the larger issues of data silos, open access, and the need for stable, standards-based APIs, issues that are all major obstacles to the future of digitally-supported humanities research. Data visualizations, digital humanities projects, the changing face of publishing, and work in altmetrics makes clear the need to revise our ideas of the critical ecosystem of scholarship. Our research products, while still overwhelmingly print-based, single-author articles and monographs, increasingly include other forms of writing, research, and presentation. Even though Bibliopedia focuses, for now, on journal articles and books, it is easily extended to incorporate other resources into its network of data. For example, I recently worked with one of the librarians at Stanford to allow him to input a catalog of video game holdings. The benefit for him was two-fold: the transformation of his catalog into linked data and the revision tracking and collaboration enabled by using a wiki format for entries.

As we gain access to more data sources Bibliopedia will, by aggregating data from as many sources as possible, converting citations into semantic web format, and then cross-referencing an ever-growing database of scholarly works, be able not only to overcome many of the limitations of existing tools and become a powerful research tool in its own right, but also to make a valuable contribution to the growing semantic web. Providing open access, high quality metadata about humanities scholarship will enable others in the semantic web/linked data world to process that data in new, unexpected ways, which will accrue further benefits to the scholarly community. For example, the standards underlying the semantic web make data visualization and automated inferences about relationships trivially easy rather than the complex problems such tasks currently present. Bibliopedia will, then, through the innovation of placing metadata about scholarly literature into a linked data format, open up a vast range of possible future innovations and analyses based on that data, which is currently locked away and readable only by those with access to these silos.

Another virtue of a linked data format is that it will help resolve many of the challenges inherent in metadata. Rather than attempt to solve this difficult problem through automation alone, Bibliopedia will, in the process of displaying its results for human use, also provide for human feedback in the form of correction and elaboration. A common disadvantage of fully automated text analysis and data extraction tools such as Google Books, Google Scholar, and other tools is that their automatic parsers introduce and perpetuate errors in their metadata that they do not allow subject matter experts to correct. Bibliopedia is pursuing the goal of unifying that information into an environment that not only displays the information efficiently, but actively encourages crowd-sourcing metadata on books, articles, and publications of all kinds. In thus opening data up to revision by the scholarly community, Bibliopedia can build on the strong work of mature data silos, improve overall data quality, and provide the academic community at large a continuously evolving research tool.

Now I'm going to touch briefly on the technical details. Bibliopedia consists of four main components: 1) servers to host and run all the components of the system; 2) custom code to crawl data sources, retrieve article and book data, transfer the data to the 3) citation extraction engine, and then submit the results to the 4) Drupal-based web application. we successfully created a scalable, data-source agnostic crawling architecture, adapted the ParsCit citation extraction software, and developed a web-based application for publishing data in a linked open data format and for tracking the changes to the data made by the scholarly community.

The user-facing part of Bibliopedia is built on Drupal, a robust open source content management system that runs many, many sites, including HASTAC's. Thanks to Drupal 7's native support for RDFa, a lightweight semantic web data format, and a host of contributed modules that extend this functionality, we are able to create and consume linked data in a very straight-forward manner. Drupal allows for the creation of mappings among its native content formats and the data structures described in different linked data ontologies. Moreover, Drupal provides a simple interface for importing other ontologies as needed. We settled upon some of the most widely used data formats: Dublin Core, Friend of a Friend (FoaF), and the Bibliographic Ontology (BIBO). Drupal allowed us to blend these ontologies and ensure that all records for journal articles, journals, authors, etc. are available as linked data. Some of Bibliopedia's features include the ability to import data from many formats (XML, JSON, CSV, linked data), a RESTful web services API for data ingestion, SPARQL queries, and Zotero compatibility.

But all of this work—crawling different data silos, extracting citations, conversion to linked data—is but the first step. I envision Bibliopedia as a data-seeded scholarly community where students, faculty, and others can come together to provide the crucially important interpretations and contextualizations of these citation networks. To this end, the software also allows users to add summaries and further discussions, to add new sources that the crawler hasn't found, to correct metadata (because citation extraction is an inherently messy and error-prone process), and to adapt the software to new ends. Bibliopedia thus represents one possibility for an advanced research platform for humanities scholars.

We designed Bibliopedia, from the beginning, to be extensible, open, and standards-oriented. The code for it—including our scalable, asynchronous, modular, and extensible crawler—has been in stable release for some time and is available as open source software on Github. You're welcome to use any part of it and share any changes you make with us. Or, and this would be my preference, to use the existing installation I have running. Our NEH grant paid for several years of hosting. Since I already have everything installed, configured, and working, I would love for others to use this platform. So, if you're interested, contact me and I'll get you set up so that you can import your data or just play around with the system. In the meantime, I currently have several undergraduates at Stanford preparing a database of the scholarship produced by faculty in the Division of Literatures, Cultures, and Languages for import into the system, as well. And I'm in discussions with the Stanford University Libraries about applying for grants to use Bibliopedia to further the library's own deep interest in linked data. Just as I want to see access to our scholarship opened up for new uses, so too do I want to see Bibliopedia and other similar tools opened up for access. If you're interested, again, contact me and together let's build the future of humanities research. Thank you.

107

No comments