There is no such thing as the “unsupervised”: Machine Learning and Critique
James E. Dobson
NOTE: The following is a draft section of a paper manuscript that is currently in progress. I want to share this with the HASTAC community as I am eager for feedback and this is the ideal place to engage in these sorts of discussions.
Within the humanistic fields of literary and cultural studies there has been a new focus on the “surface” as opposed to the “depth” of our objects of study. We have seen this interest manifested through what appears to be the return of prior practices including formalist reading practices, attention to the aesthetic dimensions of a text, and new methodologies that come from the social sciences and are interested in modes of description and observation. For the most part, these methodologies have not given up on the nuanced conceptions of ideological critique that have been the mainstay of criticism for the past few decades—in fact, many of these “new” interpretations might begin with what was otherwise once repressed through prior selection criteria—but they shift our attention away from an understanding of a “repressed” or otherwise hidden object by understanding textual features less as signifier, an arrow to follow to some hidden depths, than an interesting object in its own right. These methods have the possibility to break from the deeply habituated reading practices of the past but they also risk overstating the case and in giving up on critique, they remain blind to untheorized dimensions of new tools that we are anxious to use.
Methods of what has recently been called “surface reading” can be found in many areas of humanistic inquiry. Sharon Marcus and Stephen Best’s well-known essay, “Surface Reading: An Introduction,” that introduced a volume of the journal Representations from 2009 was dedicated to the topic of “How We Read Now” and examines several variants of surface reading as an alternative to depth or “symptomatic” reading. In their essay, Marcus and Best name the digital humanities and computer-assisted reading as one important and particularly hopeful methodology for the future of humanistic study. They write: “Where the heroic critic corrects the text, a nonheroic critic might aim instead to correct for her critical subjectivity, by using machines to bypass it, in the hopes that doing so will produce more accurate knowledge about texts.” Replacing the heroic critic of the symptomatic era with the heroic code, they imagine an objective world of bypassed subjectivity. Without cultural knowledge, biases, political commitments, in other words, without being situated, Best and Marcus believe that the machine and the algorithm can produce more “accurate knowledge” about the world brought into being by subjective human beings. This is to say, that digital or computer-aided readings are imagined as escaping the subjective constraints that draw us to certain passages or conclusions. An algorithm can be excluded from the hermeneutics of suspicion because it knows nothing of the concept of hidden depth. Thus they claim that digital readings might restore a “taboo” set of goals for humanistic study: “objectivity, validity, truth.”
Even though his methodology is not explicitly or even necessarily digital, Franco Morretti is a useful figure to examine the stakes of the digital intervention being proposed by critics like Best and Marcus. Morretti has, for some time now, articulated his frustration with close reading. Unlike those critics exhausted with critique because it has been appropriated, made redundant, or even just become boring, Morretti’s frustration originates within the narrow scope of close reading that complicates his criticism of larger forces and systems. Calling the slow, close reading of an individual text a “theological exercise,” he accuses critics of giving too much attention to a small set of mostly canonical texts. He desires his proposed practice of distant reading to enable an understanding of “the system in its entirety.” In other words, it is precisely the failure of abstraction foreclosed by the specificity of close reading that motivates Morretti’s desire for a distanced position.
When he puts his distant reading theory into practice in Graphs, Maps, Trees, Morretti presents an alternative model to the digital yet still qualitative methodology imagined by Best and Marcus. What keeps Morretti’s claims “honest” is the fact that his target is not something like a hard, empirically knowable reality, but the socially constructed fiction known as the market. Thus Morretti can stake out a strong position for quantitative research within the humanities:
Quantitative research provides a type of data which is ideally independent of interpretations, I said earlier, and that is of course also its limit: it provides data, not interpretation….Quantitative data can tell us when Britain produced one new novel per month, or week, or day, or hour for that matter, but where the significant turning points line lie along the continuum—and why—is something that must be decided on a different basis.
Morretti’s turn to the scientific quantitative from the humanistic qualitative takes as its presupposition some fundamental distrust of the act of interpretation. The interpretive act of reading, in his account, is too tied up with evidence. We have what social scientists would call a selection bias always informing the practice of close reading. Morretti seeks to address this problem through the separation of evidence or “data” from interpretation. The substitution of what close reading would call textual evidence with quantitative data, the length of book titles, the number of books within specified categories sold, the number of booksellers, enables his strong claim for the quantitative approach to literature.
But of course there is no such thing as context-less data. The concept of raw data, as Lisa Gitelman and Virginia Jackson have recently argued, is a bit of a misnomer, an oxymoron as they point out in the title of their edited collection. We should doubt any attempt to claim objectivity based on the notion of bypassed subjectivity because human subjectivity lurks within all data. This is because data do not merely exist, but are abstractions imagined and generated by humans. Not only that, but there always remains some criteria informing the selection of any quantity of data introduces the human and subjectivity into what is supposed to be a distinctly human-less product: those raw elements that we imagine can be computed, distilled, and analyzed free of subjective intent.
If one wishes to bypass as much critical subjectivity as possible, Best and Marcus seem correct: using computers to help “read” texts might be the best way forward. Best and Marcus do not elaborate on the specific digital technologies that they believe will lead to objectivity and they do not differentiate between computer-aided and completely automated approaches. Machine learning is a technology that participates in both and thus provides an ideal test case for examining the possibility of objective readings of literature. To provide some background, we need to situate machine learning in its place as a branch of artificial intelligence. These techniques uniquely have the ability to address an incredibly large amount of data with varying degrees of input from a researcher. They have been used to transform approaches in fields as diverse as economics and cognitive neuroscience. Machine learning implies the repetition of an automated task that can reflexively integrate the results of past tasks. Ideally, with each repetition the algorithm improves the accuracy with which it performs the task and therefore is considered to be “learning.”
There are two kinds of machine learning algorithms: supervised and unsupervised. Supervised machine learning can categorize data into predefined and predetermined categories. What makes supervised machine-learning algorithms “supervised” is the existence of something called a training dataset. In this form of machine learning there are always two datasets. The human researcher parcels a set of texts or other objects into two buckets. The first is the training dataset. Here “labels” are attached to each object that defines its membership within a category. The algorithm “trains” itself on the training dataset. After this training it extracts features and uses these to categorize the objects into the researcher-created categories. These features are then used on what is called the “test” dataset. The texts comprising the “test” dataset should be similar to those found within the training dataset. The algorithm then automatically sorts the data within the test dataset into the categories defined by the researcher.
If the use of the term “supervised” by computer and information scientists suggests the presence of what Best and Marcus would call “critical subjectivity,” then it must be understood as the intrusion of the human subject. Supervision means that our interpretation of the results, the output from the algorithm, must take into account decisions made by the researcher to establish a set of initial conditions. These conditions might be the existence of labels that, while not providing explicit rules or criteria for categorization, mark each text as unambiguously a member of a particular category. Thus the results of any supervised algorithm contain traces of decisions made by the researcher, precisely the “subjectivity” this work might be imagined to lack. Unsupervised algorithms would presumably lack any such influencing traces of the researcher. Yet this unsupervised state cannot be said to exist. The researcher must, as Gitelman and Jackson remind us, necessarily make a set of decisions in forming the original input dataset, even if it is completely unlabeled and considered disorganized. We must also choose an algorithm and an implementation of this algorithm. Not only will different machine-learning algorithms give different results, but differing implementations of the same algorithm may not agree. Reproducible results will depend upon the precise replication of the software and hardware environment used. Reproducibility remains an ideal, but in practice is very difficult to achieve, even more so when we are searching for small yet statistically significant bits of evidence for our claims.
Supervised methods are frequently called computer-aided. An application using machine learning might, for example, make use of a dictionary of key terms that define topics of interest that can be used to index documents. One such method is known as sentiment analysis. Used mostly by social scientists and those in marketing fields, sentiment analysis takes a set of terms associated with positive and negative emotions and then automatically sorts texts or fragments of texts into these categories. The sub-categories and key terms are hardly universal; these terms are the product of the specific period in which the dictionary was assembled. A notable example can be found within the psychology dictionary that forms part of the sentiment analysis dictionary distributed with one commercial package, ProvalisResearch’s QDA Miner / WordStat.  WordStat’s psychology dictionary contains a set of 3,150 total terms that align concepts and phrases into groups associated with psychoanalysis. Not just any dialect of psychoanalysis, however, Colin Martindale, the author of this dictionary organizes his terms into areas associated with the practice of Jungian psychoanalytic analysis. One idiosyncratic grouping, “Icarian Imagery” demonstrates the model used by this dictionary. Martindale identified these terms, and also those terms making use of these root-words, with the sub-category of ascension. They have been grouped together within the larger Icarian category:
Of course psychological concepts from the field after the influence of the cognitive and brain sciences have no representation within the dictionary. Thus this dictionary would enable one to locate potential sources of evidence for reading Jungian imagery and associated categorizations of sentiment within a text but not, say the terms used by the New Psychology of the 1890s that preceded psychoanalysis as the dominant discourse or those from the present that reflect an understanding of the mind derived from empirical studies of the brain.
I invoke this dictionary to question some of the assumptions held by those promoting versions of machine reading and to also question the possibility of formalized and automated reading. Reading, I would claim, is always situated. But I would hesitate to ask digital humanists to limit the application of their methods to historicist approaches that would take, as an example, this Jungian dictionary and apply it to literary works that appeared at the exact same time. To do so would be to give up on much of the promise of digital methods and produce only a slight improvement over existing historicist readings. At the same time, we should recognize that computational science itself is always historiciziable. Even though they tend to increase, data resolutions and system capacities are subject to hardware limitations. Algorithms change and are modified. Bugs are discovered and new ones introduced. Scientists depending on complicated configurations of software known as “pipelines” or workflows are discovering this. In addition to archiving collected data, these scientists are now seeking to archive the exact versions of software used to analyze and produce the final end products of their pipelines. Not just software, but also the hardware used in data analysis can produce differences that can introduce variance into the results of computation. All of this is to say, that just because we are using machines to read it doesn’t mean that they produced the final, definitive reading.
There are, however, other widely used methods of digital reading making use of machine learning that do not depend upon either pre-labeled data or the assistance of user-created or supplied set of key terms. The method of topic modeling would seem to relieve us of the need for specialized dictionaries like the one distributed with WordStat. Probabilistic topic modeling, or simply topic modeling, is an emergent digital reading method that is quickly becoming popular. This method comes to the humanities from the information sciences; to what extent it might still belong to the later is an open question. Topic models are a way to organize a large and unlabeled collection of documents into computer-generated thematic categories. Rather than supplying a list of hierarchical keywords to group documents, the algorithm “discovers” shared topics based on textual features that are used to fit documents into the discovered categories. Using single words to build our list, we might receive the following output of possible topics from Henry Adams’s The Education of Henry Adams:
Topic 0: adams, henry, minister, felt, john, washington, young, hay, asked, came saw, went, point, took, wanted, long, war, century, reason, father.
If we decide to use multiple words to locate possible themes, what are called “n-grams,” we might receive the following output:
Topic 0: private secretary, knew better, lord russell, diplomatic education, young adams, lord palmerston, young man, young men, free soil, earl russell, eighteenth century, english society, fayette square, fifty years, foreign affairs, francis adams, henry adams, george washington, half dozen, harvard college.
Some of the same words are captured: “henry” and “adams” are grouped together as they most frequently appear this way. The term “Washington” appears as the name of the city as a proper name, but when we search for phrases it is returned as the proper name. We note that “war” no longer appears in the first topic grouping, although “civil war” appears in the second list of possible topics (not displayed).
But even these unsupervised implementations of machine learning algorithms are subject to some of the critiques outlined above. For example, all machine-learning implementations, both supervised and unsupervised, that deal with text need to make some conversion and initial reduction of the string of characters that comprise a text or document. And if a text has been digitized from a print edition, then one has to make a selection of the digital edition. Potentially the machine-learning package will attempt to convert your text from one encoding to another, for example from ASCII to UTF-8 or vice versa. In the process, accent marks and other textual features may be removed or translated to equivalent marks. The workflow or set of procedures might perform what linguists refer to as lemmatization on the string of words, which is to say, the operation that trims each word into its smallest components, as well as removing plurals, capitalization, punctuation, and tense. For humanists, this process produces potentially large-scale information loss.  In addition, most machine learning implementations used on text include an exclusion list, or stop words. Stop words are terms that are considered to be lacking in semantic content. These words are removed before running through the algorithm because they are considered superfluous; they are “noise” that would make that task of document classification much more difficult. MALLET (“MAchine Learning for LanguagE Toolkit”), a popular and free topic-modeling package, contains a default stop word list of 524 English language words. While this set contains words like “you,” “no,” “but,” “and,” and “whatever,” it also contains terms of potential interest like “associated,” “appreciate,” “sorry,” and “unfortunately.”
In conclusion, I want to state clearly the presupposition of my argument: unlike the truly quantitative fields, within humanistic practices both digitized and digital objects cannot be said to contain noise. Everything is signal. Everything signifies, even, as deconstruction has taught us, absent figures are capable of producing signification. The concept of machine reading might lead humanists to put away important critical tools that still have work to do, but this would be a mistake. Even in what I have presented as the most objective form of machine reading, unsupervised machine learning, we find aspects that require critical attention and the appearance of subjectivity and decision where some might expect to only find objectivity, validity, and reproducibility.
 As I work within American literary studies, many of my references will be the local application of what I describe as larger movements within the humanities. On description as method, see Heather Love, “Close but not Deep: Literary Ethics and the Descriptive Turn,” New Literary History 41, no. 2 (2010): 371-391. An example of the renewed interest in literary aesthetics can be found in the Christopher Looby and Cindy Weinstein, “Introduction,” American Literature’s Aesthetic Dimensions (New York: Columbia University Press, 2012). For an example of the new formalism, see Samuel Otter, “Aesthetics in All Things,” Representations 104, no. 1 (2008): 116-125.
 Stephen Best and Sharon Marcus, “Surface Reading: An Introduction.” Representations 108 (2009): 1-21.
 Franco Morretti, “Conjectures on World Literature,” New Left Review 1 (2000): 54-68.
 Franco Morretti, Graphs, Maps, Trees: Abstract Models for a Literary Theory (New York: Verso, 2005), 9.
 Lisa Gitelman and Virginia Jackson, “Introduction,” “Raw Data” is an Oxymoron. Edited by Lisa Gitelman(Cambridge: MIT Press, 2012).
 See the documentation provided at http://provalisresearch.com/products/content-analysis-software/wordstat-...
 Colin Martindale, Romantic Progression: The Psychology of Literary History (Washington, D.C.: Hemisphere, 1975). The digitized version of Martindale’s dictionary can be found at the following URL: http://provalisresearch.com/products/content-analysis-software/wordstat-...
 Key references here include Stephen Ramsay’s Reading Machines: Toward an Algorithmic Criticism (Urbana, IL: University of Illinois Press, 2011) and Matthew L. Jockers, Macroanalysis: Digital Methods and Literary History (Urbana, IL: University of Illinois Press, 2013).
 Both examples are from running Henry Adams’s The Education of Henry Adams through the CountVectorizer() topic-modeling algorithm provided with the Python-based package called “sci-kit learn”: http://scikit-learn.org.
 One of the most popular stemming algorithms is the Porter Stemming algorithm. This is incorporated within the workflows as a preprocessing step by many packages including ProvalisResearch’s WordStat. M.F. Porter, “An Algorithm for Suffix Stripping,” Program, 14, no. 3 (1980): 130−137.