Blog Post

A Primer on BigDIVA and the Future of Search?

A Primer on BigDIVA and the Future of Search?

Just a week ago I had the pleasure of attending the launch of the Big Data Infrastructure Visualization Application (BigDIVA) at NC State's Hunt Library, and I thought it might be interesting to do a short write up for HASTAC.

You can try BigDIVA out yourself, though its resources will be limited to your current permissions to databases like JSTOR. If you are in the Raleigh-Durham area, you also can participate in usability studies and offer your feedback by contacting NC State's Dr. Tim Stinson.

Before going any further, I'd also like to thank Dr. Laura Mandell of Texas A&M University for her presentation, discussion, and for graciously sending me her presentation for the purposes of this post.


What is BigDIVA?

BigDIVA is an interface for navigating datasets in a visual, transparent, and likely more intuitive way. Its visual component is largely comprised of a web of nodes, each of which can be interacted with and expanded into further subsets until you reach individual resources like pdf files of articles, for instance.


As you can see in the image above, BigDIVA offers search transparency in the sense that it presents all of the search parameters that users don't select in addition to the paths they have chosen. It is in this way that BigDIVA hopes to afford users serendipity in search, a sort of holy grail in information navigation. They describe it like this:

BigDIVA turns query-based searching into something much more like exploring items in a library's stacks: it allows users not only to view the results that they expect from a given search criteria, but also to discover items related to their search in ways that they did not expect.

This interface is meant to scale smartly between large displays (like the one in the banner image) and smaller personal displays (like laptops and tablets). It is also complimented with a timeline at the bottom that users can adjust on the fly to mark the temporal boundaries of their search, or, perhaps more interestingly, users can drag its handles to watch its web light up and discover periods of rapid and sluggish production of items matching their query.

It is worth noting here that BigDIVA also contains a textual module that presents information in a more traditional format alongside its visual interface. This allows users to browse results in a list format alongside their visual representation.


How does BigDIVA work?

BigDIVA is currently being collaboratively developed at Texas A&M University and NC State University to interact primarily with the Advanced Research Consortium (ARC) catalogue. This catalog contains a number of period- and subject-specific collections like NINES, 18thConnect, MESA. These collections are a digital humanities project, noticeable in that these primary examples are of 19th century, 18th century, and medieval resources respectively.

All of these catalogs, and BigDIVA itself, run atop ARC's intrastructure. That infrastructure is grounded on an Apache Solr Index, which ingests data and structures it as follows:

Regardless of the method used to ingest data, there is a common basic data structure for data being fed into a Solr index: a document containing multiple fields, each with a name and containing content, which may be empty. One of the fields is usually designated as a unique ID field (analogous to a primary key in a database), although the use of a unique ID field is not strictly required by Solr.

These fields are determined by the ARC Research Description Framework (RDF), which is (necessarily) a source of lively debate and perhaps one of the aspects of the project that interests me the most personally.

This index is hosted by ARC's COLLEX, "a sophisticated COLLections and EXhibits mechanism for the semantic web". COLLEX is responsible for inputing ARC RDF files into the Solr index, and though I am not fully clear on this yet, also for managing search outputs as the intermediary between the user interface and the Solr index.


What does BigDIVA do well?

For me, BigDIVA is an astounding accomplishment given its context. From humble origins, its staff has managed to secure funding and produce a large scale collaborative endeavor that seems nearly ready for inclusion into all university libraries.

BigDIVA's visual components make the pre-existing structure of library databases easily perceptible to its users. Not only is BigDIVA itself intuitive, I would imagine it would have a recursive effect of better clarifying the structure of library collections for library users that have not yet had that particular epiphany.

It's optical character recognition (OCR) module TypeWright does an impressive job, especially given the manuscripts it is being asked to handle.

And lastly, it certainly does facilitate a certain type of discovery through visualizing materials not included in search results, and providing an interface that facilitates intuitive manipulation and browsing of heterogeneous collections.


What are BigDIVA's limitations?

BigDIVA certainly has limitations, but I think it is important to mark distinctions between those internal to the system itself and those imposed from without by things like limited funds (relative to commercial enterprises like Google), operation within academic bureaucracy, and longstanding disciplinary entrenchment. That said, here are a few things worthy of attention...

  • Disciplinary Limitations on Resources: There is a readily apparent uniformity of discipline in the resources BigDIVA presents. Coming from a background in media studies, I was at once at home and a stranger in BigDIVA. This limitation could be due to the difficulty of producing homogeneous metadata across differently disciplinary databases (which is difficult enough in databases housing content from the same field). It could be due to bureaucratic and/or economic issues in acquiring access to these materials. Or it could simply be a temporal problem that is or will be in the works and solved it the near future. In the best possible scenario they would develop a machine-learning-based algo that can automate the process of metadata production for any and all materials for input into Solr.
  • Little Support for Natural Language Queries: The search terms operate more like a traditional library's. I can't tell whether they are simply Boolean keyword searches based on frequency and placement of occurrence, or if they are more complicated than that. However, natural language search does not work so well. A lot of my colleagues have problems with students expecting library resources to operate like Google Search and interpret the meaning of their sentence-length statements or questions. An effort to render search more intuitive for students expecting natural language recognition could better take this into account.
  • Speed and Stability: The site, and especially the timeline, runs very slowly on my home computer and sometimes freezes. Upon refresh, the path I've been tracing is lost. This problem might be a fiscal one in terms of server bandwidth, or it might be internal to COLLEX and/or Solr. We can't expect BigDIVA to match Google's Dremel or similar multi-million dollar efforts, but increases in speed and stability will be necessary for this to become an everyday tool for scholarship.
  • Permissions and Access to Resources: The closed nature of the majority of academic databases, collections, and journals makes permissions a really difficult thing for BigDIVA to navigate. A large portion of its resources aren't available to general users, and for home use would conceivably need to be routed through a library proxy (which slows things down a bit). This is an inherent problem of the academic corpus though, and one that BigDIVA handles remarkably well given the problem's enormity.
  • Limitations of the Data Structure: Lastly, there are some inherent limitations to the data structure imposed by any Relational Database Management System (RDBMS). Its navigation path is always more or less arboreal because of its nested columnar structure. What this means for BigDIVA is that the path visualized is always unidirectional and/or linear, and thus it isn't a true internetwork. There is nothing wrong with this unless we stop with BigDIVA and presume that we have liberated search from its black box. It would be very interesting to see a similar project run through Neo4j. Graph data might allow for even more radical forms of discovery, specifically by offering things like query by example tuple which would allow you to locate items with similar graphs at other ends or scales of the graph (as visualized for the user).

Anyhow, these were my thoughts after attending the presentation, speaking with Dr. Mandell during lunch afterward, and reflecting for a week afterward. I would strongly encourage you to check out and keep tabs on the project, as it is one of the most ambitious and impressive collaborative projects that I have seen come out of the humanities in my (admittedly rather new) career.



No comments