Blog Post

Introducing Bibliopedia: Data-mining, Cross-referencing, and Collaboration

 

Last March we had a HASTAC Scholars discussion titled “Digital Textuality and Tools” that Angela Kinney and I co-hosted. One of the things that this vibrant forum led to was a description of what sort of digital tools we as humanities scholars would love to have, but currently do not. From this and other conversations on HASTAC and ongoing talks I had been having with a software developer friend of mine grew Bibliopedia, a project to develop just these sorts of digital tools.  Along with myself and my coder friend, we also have Ana Boa-Ventura doing web design, another HASTAC scholar. We received a grant from The University of Texas’s Liberal Arts Instructional Services (LAITS)  to begin work and are now putting together our application for an NEH Digital Humanities Start Up Grant so that we can keep going. Since that deadline rapidly approaches, I want to introduce the wonderful HASTAC community to what we envision for Bibliopedia and to solicit feedback so that we can further improve our vision.

There are, essentially, two components to the project. The first is the automated data-mining and text analysis aspect of the software. The second is the crowdsourcing of data verification and elaboration. Bibliopedia will be an open research-enabling platform designed to unify the many disparate, closed silos of scholarly information that are available today but remain difficult and time-consuming to employ. Too often, much of a researcher’s effort is spent simply bringing together all of the available information on a particular subject. What is more, a common complaint of Google Books, Google Scholar, WorldCat and other digital research tools is the fact that their automatic parsers have errors in their metadata that they do not allow subject matter experts to repair.

Bibliopedia will unify those now common automated data-mining approaches of Google et al., and will also provide subject matter experts the tools necessary to correct metadata and otherwise to extend the information available from automated data-mining. Bibliopedia will pursue the goal of unifying that information into an environment that not only displays it efficiently, but actively encourages the crowdsourcing of metadata about books, articles, and other scholarly objects. By thus opening data up to revision by the scholarly community, Bibliopedia can build on the strong work of the other mature data silos, improve overall data quality, and provide the academic community at large a continuously improving research tool.

Via JSTOR, Google Scholar, and other full-text scholarly resources, Bibliopedia will provide advanced data-mining and cross-referencing of the scholarly articles and books that discuss a narrowly-focused set of primary literary texts. By focusing on individual texts rather than broad swaths of scholarship, Bibliopedia will not only allow for a deeper examination of the relevant works, but also permit the creation of a collaborative community of researchers, students, and others interested in studying the primary texts. This community will further enhance the data, cross-references, and bibliographies gathered by the software itself by providing user-generated content, discussions, and evaluation. Bibliopedia will also offer advanced visualizations of the data to permit scholars to discover new connections between works and to understand more easily the contours of existing scholarship. Bibliopedia-powered portals will thus serve to aggregate available data that is often inaccessible or invisible to users, thereby providing not only a single location at which to begin research, but also a scholarly community that will collaborate to generate new knowledge. Bibliopedia will also seek to deploy as many different existing open access technologies possible in order to make rapid progress and avoid reinventing the wheel, an all-too-common pitfall of many technology projects.

Sustained, deep interaction between a collaborative user community led by subject-matter experts will join with advanced data-mining and cross-referencing software to generate new, innovative ways of viewing, discovering, and interacting with primary and secondary literature. Not only will the results of the data-mining software of the project provide a valuable, well-populated database of information and cross-references that uncover neglected works and connections previously invisible, but the addition of a social component in the form of user-contributed wiki-style information, tags, relevance rankings, summaries, reviews, abstracts, custom bibliographies, and discussions (among the many forms such interactions will take) will extend the automatically-generated information in further beneficial ways to researchers at all levels.

The user interface and community-enabling aspects of the site, therefore, are of major import. While the data-mining and cross-referencing components are absolutely fundamental, they alone are not enough to create a vibrant, indispensable resource. As the recent explosion of social networking platforms and collaborative projects demonstrates, the creation of an engaged community of users from diverse realms substantially improves products. From the (at times controversial) success of Wikipedia in replacing traditional encyclopedias to the seeming ubiquity of Facebook to the prevalence of crowd-sourced knowledge creation, the importance of software that enables community has never been clearer.

While we currently do not have a site you visit for a demonstration of the software (we are still working mostly on infrastructure and design), I hope this description is concrete and clear enough to give you an idea of what we are are planning. What pitfalls do you see? What issues does this sort of work raise more generally? What features do you think are crucial to such work?

 

34

4 comments

Wow! An ambitious project, but one that seems to carefully address many of the problems we face with the explosion of collaborative research tools. Great work!! :)

I'm having trouble visualizing how it would work and what the scope is. You make a distinction between primary and secondary texts, and at one point mention "primary literary texts" -- is literature then the focus? And how would the kind of collaborative work this tool would foster break down the distinction between primary and secondary literary texts (perhaps even forcing us to question the category of "literature" itself)? One the one hand, this tool as described above seems aimed at supporting traditional literary scholarship (close readings of individual texts); on the other hand, it seems to have the potential to nullify those very practices. Of course, that's always the problem with proposing new methods -- one is never quite sure how it'll get picked up! -- but it'd be interesting to see that issue addressed more directly.

More broadly, there's the lurking issue of creating just one more portal (a la Google Buzz) -- and how to get the buy-in. Will this be a centralized resource, or decentralized and able to be linked up to individual library catalogs, etc.? If it can be plugged into the communities I already engage, I'd adopt it right away; if not, it's one more bookmarked site to check.

Again, great work! Very exciting stuff.

 

45

Hi Whitney. You ask some great questions. Yes, we're focused on literary texts for now and for a couple of reasons. One is because the community aspect of this project is so crucial that without a narrow focus, I don't believe we can build a useful community very quickly. Look, for instance, at sites like LibraryThing that allow for comments on each item, but have almost none in many cases because there's just so much data. My hope is that by focusing on a smaller set of texts, we can build a more coherent community. Then, eventually, people can deploy the software for other works as they want. The emphasis on literary texts is also a way to ease people into using such tools who often still think in terms of individual books, physical objects, or whatever texts they habitually work on. We're starting, for example, with the Travels of John Mandeville, a medieval travel narrative. As for visualizing how it will work, I've actually found a closely related endeavor here: http://www.nines.org/ While this project is a great example of how a scholarly community can build around digital tools, it has a number of limitations that we want to move past. For instance, they're not analyzing full text sources in order to cross-reference anything nor do they have an obvious mechanism for crowd-sourcing the data or significantly extending the information available. I envision people providing summaries to important articles, discussions of impact, etc.

As for being another portal, we're following the aggregation/mash-up model. That is, we want to bring together as many different information sources as we can. We're currently crawling the Library of Congress catalog, are waiting for API access to JSTOR, and are working on crawling Google Scholar. We'd love to tie in to other catalogs like WorldCat, but there are licensing and access issues we have to work through still. Since we're using Mediawiki for presentation because there are already book, article, etc. templates and because the openness to editing it enables fit well with our goals, the whole thing would be, at first, a centralized location, but one that, as I mentioned, draws on a multitude of sources. Eventually, we want to decentralize it more and maybe develop some browser plugins or embeddable web-page widgets. That's well down the road, though.

One final note. You mention this is ambitious, and it definitely is. Thankfully, we're not trying to develop everything ourselves. One of our default positions is to use software developed by others whenever possible. We don't want to reinvent any wheels. Like I said, we want to mash-up everything we can, then add the glue as needed. So, I think that allows us to be more ambitious than we could ever hope to be if we had to develop all the component tools ourselves.

57

Hi Mike. I recently decided to give Mendeley a try, and while I haven't really used any of the social networking or datamining tools yet, the core functionality (PDF/bibliography management) is very impressive. It's definitely not perfect (I'd prefer an open-source desktop client, for example) but I like the fact that it's cross-platform and syncs with Zotero, and it seems to have a lot of strong institutional and commercial backing. The list of founders includes people from Last.fm and Skype, so they have some experience building successful social networks.

Do you see Bibliopedia as in competition with Mendeley? Or with SEASR, or Google Scholar, or even Zotero? I can see that these other projects individually don't provide exactly the functionality you're describing here, but it seems like it might be a tough space to get into.

35

Travis, you're absolutely right that there's quite a bit of competition in this realm. It's something that gives us anxiety attacks at times when we discover yet another tool that seems too close to us. The most recent scare was from Collex, which is a fantastic piece of software (used to power that NINES site I linked earlier) and which has a Mellon grant and several years of development behind it. I've played around with Mendeley a bit, but haven't found that it has functionality that has yet made it become a regular part of my work. As far as I can tell, the only automatic data extraction it does is to try and discover the metadata about whatever specific PDF you import, that is, that one author, title, journal, etc. That's one place we want to differ. We want to parse out all the texts a PDF (or other full-text source) cites that we possibly can, then cross-reference all those links. This ability allows for advanced visualizations (think citation networks, clouds showing number of citations of individual articles [like SEASR's author centrality visualization], tracking "discussion threads" across journals and spans of years, etc.) that aren't, to my knowledge, currently available. 

Similarly, SEASR is obviously a great tool (even if it's a little difficult to figure out how to actually use it for literary research), but I haven't found the Zotero integration to be very useful, either. Perhaps that's my fault for not understanding how to use the tool very well. But it does seem like SEASR is designed as a general purpose tool with a lot of capability, but not specifically to do the sort of cross-referencing work we want to do. It might be adaptable to our purposes, but I can't tell yet. Since, however, we purposefully don't want to compete with people, if we can adopt it for our uses, we will. We're definitely not trying to compete with Zotero since our goals are quite different. Google Scholar, maybe, but we're not interested in actually digitizing texts. Plus, their citations are not that great (lots of metadata problems, data set limited mostly to books).

Here are the key differences I see between Bibliopedia and these other tools. First, many (most? all?) such tools focus on building broad infrastructures that can be applied to many different cases. SEASR seems to be this sort of tool. Actually taking the tools and seeing an immediate application, however, is difficult. Further, we humanities scholars are trained to think in terms of individual texts and genres. Tools that don't address that disciplinary bias don't (so far) have very good uptake. NINES, on the other hand, (powered by Collex) because it focuses on a specific subject matter and has a group of experts involved in data curation, is a better example of a successful, well-established, and most importantly useful model. While we intend Bibliopedia to work on general cases, we're aiming at literary scholars interested in a focused set of texts. Doing so will, I hope, allow us not only to explain to non-DHers how to use the tools, but also attract a broader community interested in that specific subject. It's much easier to make the case for, say, a portal that draws together a bunch of resources about The Canterbury Tales than it is to provide a really broad tool (which Bibliopedia intends to be, in its essence) that can be applied to anything. The community, wiki aspect of it is incredibly important, too. Wikipedia and the proliferation of other wikis has already demonstrated the power of the concept. So, we want to join crowdsourcing to focused data-mining for cross-references in a web-based application.

And, yes, there are quite a few tools with a big head start on us doing similar things. But, so far, I have yet to find anybody doing quite the same thing or in the same way. Obviously, that and the institutional support of others is intimidating, but we see a need so we're trying to fill it. If, however, anybody knows of somebody doing what we want to do, I need to know about it. But after working on this project for a year, we've yet to find it. There's a problem of little publicity for too many great tools, though. We've certainly learned a lot from looking at what other people are already doing. Lots of different people are trying to solve related problems in this field, to be sure.

51