Gale’s Digital Scholar Lab: Making DH Easier?
On the 26thFebruary, 2019, I attended a trial one to one session as well as a workshop to experiment with Gale’s relatively recent new digital humanities tool: The Digital Scholar Lab (DSL). These were run by, respectively, Thomas Piggott, the User Experience Lead for the Digital Scholars Lab, and Dr. Sarah Ketchley, one of two academics working as Gale’s Digital Scholarship Specialists at Gale. This tool has been online since September, and continues to be revised in order to make the user experience the most useful and simple. It allows you to create your own corpuses (or content sets, in their parlance) from Gale’s prodigious digital archive, clean the data, and then perform common DH tools upon your corpus, without the need for any coding knowledge. These tools are: topic modeling, clustering, named entity recognition, sentiment analysis, ngram and parts of speech tagging. In other words, it makes DH easy. In this blog post, I am going to provide an overview of my initial thoughts during the trial session and workshop, and then my critique of whether it makes DH too easy.
During the one-to-one session, Thomas asked me to approach the software as if he was not there, but would occasionally ask me questions about what I was looking for, what I hoped to see, and how intuitive I was finding the interface. Given my general approach to dealing with new technology is to experiment with it, press lots of buttons, google tutorials, and hope for the best (which, admittedly, usually works fine), it was interesting to be forced to slow down a little and think about why I was instinctively looking for or performing certain actions. I was slightly frustrated that they did not distinguish between peer-reviewed material and other material, as both have different utilities to my research. I was also surprised by the fact that the date went back not to their earliest available material but to 1000 years before, to 1019. Even though I am not a medievalist, for a moment I imagined OCR-d, plain text files of manuscripts that alas were not available (the earliest text I found on Gale was from 1543). Additionally, it was not entirely clear that you had to select the documents you wanted into your content set, which had to be preset manually, rather than defaulting to selecting all and then giving you the option of choosing which content set suits the material (the latter of which is going to be available shortly).
For the most part, though, I did indeed find it very intuitive – the advanced search terms were comprehensive and I found a wealth of information about the three poets I am writing my dissertation on, information in places I did not particularly expect to find it, and which provided interesting potential paths for my research. I could then download documents, both as pdfs (up to 300 pages) and xml files, which have no xml code, and are therefore easily translated into plain text files. Annoyingly, downloading automatically titles them with Gale’s Object ID, and therefore you still have to put in the labor of either recording them in a csv file or renaming each file according to your own needs. The ability to download is not unique to the DSL, as it is possible across Gale, but what is unique is the ability to sort documents into content sets and then download up to 100 documents at a time.
At the time, this was really exciting, as was the fact that I could quickly, and with no coding needed, perform topic modelling and clustering, amongst others, on my content set. All of these tools were easy to run. (The format of the Named Entity Recognition was the only one that I found strangely organized – Thomas helpfully showed me the most recent version of this on his laptop, which was much more user-friendly and will hopefully be rolled out soon). I could then download these visualizations, which was as important as allowing me to download the documents means this information lives on my hard drive and does not rely on Gale’s continued support for this interface.
While the one to one session and workshop were exciting, as I experimented with the lab over the following weeks, I found more to criticize –– and in particular, I was wondering whether the fact that this was souser friendly and easy was actually a problem. While Gale is very good at explaining what OCR is, why there might be errors, and why this is something to be aware of, there was no support for what you needed to consider when creating your content set. DSL is providing a service that aims to make DH both easier and more accessible, to those who don’t have digital humanist faculty or librarians, or have no coding experience, or both (but in all cases, belong to institutions which can afford enough of Gale’s archives to make this useful, as well as the DSL fee on top of that). As such, it concerned me that there was very limited guidance. For example, I had encountered enough material on corpora compilation that I had a rough idea of how to go about it, but if I had not, I think I would have been somewhat at a loss as to where to start. And when I turned to the ‘Clean’ section, even though I have used OpenRefine, have been to several talks about the importance of cleaning/ tidying data, I was completely baffled.
Additionally, the visualizations are only as good as the content which is only as good as the OCR and its metadata. When trying to find the oldest text, I came across a book titled Liberal Judaism and Hellenism and Other Essays, by Claude G. Montefiore, listed as being published in 1198. It was very clearly not that old, and as it was within the archive Nineteenth Century Collections Online, I assumed it was from the 1800s, but the first page of the preface noted World War One. I mention this not to suggest that Gale is responsible for cross checking all its metadata, or this is an issue that the DSL should solve, because I don’t think that for a second, but to show that, just as having to rename all the files I downloaded, creating corpora takes time. Additionally, your content set can only consist of what Gale has and potentially what your institution has paid for. Thomas and Sarah discussed wanting to make it possible for people to add content from elsewhere to their datasets, but this causes problems with copyright and whether Gale is then claiming copyright over the documents by default. However, given the lack of education about how databases work, undergrads might use this tool as if Gale had access to everything – I certainly would have.
DSL, through its lack of scaffolding and support for first time digital humanists, makes all this labor invisible and therefore look easy. For example, I’d never tried topic modelling – and with DSL, I could topic model my content in minutes. As an experiment, I tried to use MALLET, and spent about half an hour coming up against my lack of coding experience before conceding defeat. If I wanted to, I am sure I could have used MALLET – indeed, once my trial for DSL, or once I want to use content not from Gale, I will use it. But, for now, DSL meant I didn’t have to.
I want to be clear that I don’t think DH should be inaccessible or difficult. But I do think it should take careful thought and work, of a different kind to the work humanists are trained and used to doing. As such, creating a platform that simplifies DH work to this extent may not be in the best interests for the DH community long term. I think the DSL would be a great tool to incorporate into pedagogy, at the undergraduate level, and as an introduction to the field and methodology of DH. But while Gale is not directly implying the DSL is the be all and end all, by not scaffolding it for newcomers to DH, it implicitly is, creating the illusion that DH is easy, less work than conventional close reading, and ultimately denigrating the field.