I've been thinking recently about computational literacy: what do people need to know in order to use computation to do work that is seen as valuable? Or, what should humanities scholars be taught in order to do work that uses computation? In this post, I’m going to give a quick introduction to my perspective on these questions. (This perspective is developed through field work that I largely won’t discuss here.)
One of my main motivations in this research is that humanities scholars use computational tools that come from elsewhere—that are developed by computer scientists, for the most part. Johanna Drucker, for example, argues that visualization techniques “come entirely from realms outside the humanities—management, social sciences, natural sciences, business, economics, military surveillance, entertainment, gaming, and other fields in which the relativistic and comparative methods of the humanities play, at best, a small and accessory role.” Jeffrey Binder similarly calls machine learning an “alien form of reading” in part because it “emerged from a discipline with very different concerns from our own.”
In my dissertation research, I look at the use of computational techniques (or "tools" or just "things") like machine learning algorithms. I focus a lot on word embedding models, for example, the two main examples of which were developed at Google (Mikolov's word2vec) and at Stanford’s computer science department (GloVe). But people in the humanities pick these techniques up and use them, and this directionality doesn't seem likely to change.
So how does this adoption happen, and what literacies are needed for it to happen well?
Skipping a lot of examples and background theory here, in my dissertation I'm playing with a perspective that suggests that what's important isn't so much which computational techniques are used but the practices that happen around them and that stitch techniques together in ways that produce work that is, despite its relationship to other disciplines, distinctly of the humanities. A lot of humanities scholars I talk with express enthusiasm for computation but also anxiety over how to use it in ways that align with the specific values of their communities. They explain that there are few established processes in the humanities to guide this kind of work. In computer science, for example, validation is a standard part of the process, but in the humanities what validation looks like is currently unclear (see, for example, Andrew Piper or Ted Underwood, among others). Instead, I see people in the humanities stitching techniques together, using them in ways that computer scientists might not and creating cascades that allow scholars to creatively open large-scale data for interpretation.
(I should not here that I focus in this research on a pretty specific group of scholars: those who are working either alone or in groups, who tend to use techniques developed by computer scientists rather than tools built within the humanities and, finally, who hold strongly to the interpretive traditions of the humanities. I'm not trying to describe the work of all humanities scholars; instead, I'm using this specific subset to provoke theory development.)
So, if what's important is how techniques are stitched together, what should people learn? The idea I’m playing with is that computational literacy might involve an understanding of basic data formats (or “shapes” or “structures”), how they’re constructed and how they’re turned into each other. Instead of focusing on how algorithms function, we should maybe focus on what their outputs look like and what we can do with them.
Ben Schmidt has proposed similar, suggesting that humanities scholars focus on transformations rather than on algorithms.
It is good and useful for humanists to be able to push and prod at algorithmic black boxes when the underlying algorithms are inaccessible or overly complex. But when they are reduced to doing so, the first job of digital humanists should be to understand the goals and agendas of the transformations and systems that algorithms serve so that we can be creative users of new ideas, rather than users of tools the purposes of which we decline to know.
For Schmidt, transformations are “the reconfigurations that an algorithm might effect.” The example he gives is sortedness—rather than understanding the inner workings of various sorting algorithms, Schmidt suggests that humanities scholars should understand sortedness as a property and not as something directly tied to a specific algorithm that produces it.
The data formats that I’m putting forward as important to computational literacy are similar to Schmidt’s transformations in that they draw attention away from the inner workings of algorithms to their results. I’m thinking here of formats like document-term matrices, edge lists and correlation matrices. Many algorithms produce data in these formats, but working between them allows for unique processes to emerge.
Working between data formats might imply different skills than would be associated with Schmidt’s transformations. (A note: I largely agree with Schmidt—I’m trying to build from him to suggest another aspect of literacy, not disagreeing with his suggestion that transformations are important.) Understanding sortedness might be primarily a conceptual task; on the other hand, understanding data formats and the ways they’re transformed involves certain technical skills. I agree with Schmidt that it’s probably not terribly important for humanities scholars to understand the specific processes that algorithms enact. On the other hand, being able to manipulate data might be crucial.
A few examples:
Figure 1: Default output from Mallet.
1. Mallet’s output has always bothered me. Figure 1 shows the default format for the topic composition of documents. Here, topic proportions are ordered relative to individual documents—so, topic 1 is not always in the same column. I generally want to either visualize topic model output or perform other analysis on it. In other words, I want to stich Mallet together with other techniques. To do this, I need to reformat these data, so that they look something like figure 2. Here, topics are not distributed over multiple columns, and each row is an observation (or a text). This is closer to what Hadley Wickham calls “tidy data.” I can now do all sorts of things with these data: I can visualize them, cluster them or use them to produce a correlation matrix. Following Schmidt, these are transformations that I don’t necessarily need to understand the precise workings of—I probably don’t need to understand exactly how k-means clustering works so much as I need to conceptually understand the transformation it represents. Still, in this example I do need to know which data format I want and have enough data manipulation skills to convert Mallet’s output to that format. For a long time, I performed these transformations using a slightly absurd set of Excel scripts. This wasn’t optimal, but it was fine—it let me go on to do the work that I needed to.
Figure 2: Reformatted Mallet output.
2. Ryan Heuser has a series of posts on word embedding models, in which he sets out how vectors (series of numbers) are used to represent words and documents (as in a document-term matrix). But he also implicitly demonstrates other things you can do with data in these matrix formats, and this is where a lot of the transition to interpretation happens in his work. For example, he produces, from his word matrix, a correlation matrix, which indicates the similarity between each word in his model. From this correlation matrix, he further produces network visualizations that represent words as nodes with edges between nodes that, in his correlation matrix, are shown to be sufficiently similar. This is an ideal example of the kind of cascade I’m talking about here: Heuser creates a unique process that is distinctly of the humanities by moving between data formats in various ways. Computational literacy can be seen here in his understanding of vectors, matrices and edge lists, as well as the programming abilities needed to move between these.
3. Ben Schmidt has another (earlier) series of posts about the use of word embedding models in the humanities. Here he again argues that it’s not that important to understand exactly how the algorithms work—word2vec and GloVe are distinct processes but both produce similar output. Largely, they’re both fine. Instead, the interesting part of Schmidt’s work (in addition to usefully explaining word embedding models, suggesting ways to use them and offering an R package that helps do these things) is his discussion of vector rejection. Essentially, Schmidt collapses a high-dimensional space such that the differences in word usage associated with gender are removed (I’m oversimplifying; the blog post gives a better picture of this process). As a result, Schmidt is able to point out some really interesting things, such as that “goddess” appears to be a female-gendered synonym for “genius.” What I want to highlight here, however, is that the unique process that Schmidt puts together is built on an understanding of vector space transformations. It’s not the machine learning algorithm that’s key here; it’s the data format and attendant possibilities.
I’ve sketched out some basic ideas here and am still in the process of putting together this conception of computational literacy. I’m interested in comments and feedback, if anyone has thoughts. I might also use this space in the future for some posts on specific data formats as a way to experiment with this perspective and think about what a curriculum that builds from these ideas might look like.