Affective Response: Assessing Machine Learning Algorithms
“If my nightmare is a culture inhabited by posthumans who regard their bodies as fashion accessories rather than the ground of being, my dream is a version of the posthuman that embraces the possibilities of information technologies without being seduced by fantasies of unlimited power and disembodied immortality, that recognizes and celebrates finitude as a condition of human being, and that understands human life is embedded in a material world of great complexity, one on which we depend for our continued survival.”
― N. Katherine Hayles, How We Became Posthuman: Virtual Bodies in Cybernetics, Literature, and Informatics
Nightmare scenario: imagine that you are sitting outside, and you feel a sharp pain in your chest, severe enough to warrant a trip to the hospital. You sit in the examination room for what seems to be an eternity, waiting for the doctor to come back with the news--either good or bad. It makes no difference at this point, you just want to know. The doctor steps in and tells you that you have a week to live as determined by a supercomputer processing all of your relevant information. The news devastates you, but at least you know when your time is going to come. After a week of checking off experiences on your bucket list, you are still alive. Concerned, you go back to the hospital to find out that the program had made a mistake, and you are going to live much longer. There is a feeling of relief, but also concern. How did the program get it wrong? If the problem of determining lifespan is too complicated for a machine to understand it on its own, why torment people with this less-than-reliable information?
While this might seem like a nightmare from the future, this is not science fiction. Beth Israel Deaconess Hospital’s supercomputer does this type of computation with a 96% accuracy. While the number might seem impressive, if you take the 4% of the 557,812 patients the hospital admitted last year you get a total of 22, 312 patients who could be misdiagnosed. Even though the supercomputer uses data from over 250,000 patients, it cannot accurately diagnose everyone. Clearly, the increasing reliance on programs to gain insight into medicine should be more transparent as consumers should be more informed to the development of these pieces of software.
Earlier this year, the Duke Center for Health Informatics held a seminar entitled “Human-Computer Interaction Markers: Using Technology Interaction to Monitor Cognitive Function.” The seminar considered research that investigated how monitoring keyboard interactions can indicate an early cognitive decline. While this method seemed very promising in helping to detect an often-subtle deterioration of the brain, it did not take into account the experiences of patients themselves. A patient's experience with typing, what they are typing, where they are typing, when they are typing are all relevant contexts to take into consideration. The investigators were able to collect data from such a large group of people that they felt that these circumstances or "variables" were negligible. What was important was the fact that the model accurately could predict an early cognitive decline in the patients, even if researchers could not completely understand cognitive decline and its larger context. Privileging results over understanding is easier to justify when one is dealing with a large sample size. It becomes a matter of statistics. This type of thinking is indicative of the digital age we live in, the age of "big data," where the crowds are seen to be more accurate than any individual. Often this is presented as the "wisdom of the crowds" (Kremer, Mansour, and Perry 2014). However, this privileging of data over experience, specifically in the medical field, is bound to fail, because patient experience varies to such a great extent that it cannot be considered merely a "variable." Underlying all aspects of the Duke study, there was an underlying technology that determined whether or not a patient was at risk, a machine learning algorithm. Algorithms like these are deployed in a variety of different contexts, from Google searches to Amazon suggestions to stock predictions. Such service-oriented uses are more reliable because they rarely step into predicting human experience or assessing affect. But what happens when algorithms do consider human experience and affect?
Machine learning looks for patterns through the use of probabilistic classifiers, which, as the name suggests, utilize probability to make some prediction (Murphy 2006), for example determining your age based on your photo by statistically evaluating the similarity of your photo to others within a particular age group. Something like age that is associated with a wide variety of experiences has more subjective aspects. lf you know someone has spent much time in the sun, you might not calculate that they are as old as they look since the sun causes wrinkles. Statistics cannot give us the total picture. Scholars like John Burrows (1987), Susan Hockey (2000), Stephen Ramsey (2003), and Johanna Drucker (2012) have long debated the role of statistical modeling in humanistic inquiry. They have come to the consensus that statistics can be helpful, but should not be an exclusive path to knowledge. Humans should work together with the technology to both “close and distantly read” (Jänicke 2015) this information, to gain a complete picture of how to calculate things like lifespan or age.
Often, we look to accuracy as a measure of how effective a new program or software is, rather than looking at how complete a picture it provides us of the subject. Instead of investigating how the software works in a larger context, we take a 96% accuracy as acceptable, as in the case of Beth Israel Hospital. The issue is that technology is often privileged over the human because it can handle more information and be more consistent (Levy 2012). However, we often lose sight of the fact that these are programs built by humans. The smaller dataset that most of these machine learning algorithms use to establish patterns so that they can identify patterns is human generated. For instance, one of the first uses of machine learning was developing a consistent scoring system for movie reviews like those on Rotten Tomatoes. The training dataset was coded by graduate computer science students, assigning a positive or negative value to a set of 1,500 words (Pang, Lee, and Vaithyanathan 2002). The algorithm developed a pattern from this initial dataset to determine statistically how positive or negative a particular movie review was by scoring individual words and coming up with a score for the entire piece. Many of the words utilized are merely synonyms for great and bad, which failed to reflect the nuances of movie reviewer's language, where many words might not naturally fall into two categories. By failing to generate a complete lexicon, the algorithim leaves many of the words as not valued and ignored from the calculation, providing an incomplete picture of the review. Irony causes further issues with this type of dataset in which there is a one to one relation between value and word, not allowing for any one word to have multiple values. A movie review has the potential to be scored incorrectly whenever it contains irony or other linguistic ambiguity. Studies of sentiment analysis have taken up these concerns. One study attempted to score a sentence twice in the case of places where there is a turn at the end of the phrase (from a majority positive statement to ending with a negative one) (Wiebe, Wilson, and Cardie 2005). However, this scoring only seeks to address misreadings of positive or negative sentiment. By assuming ambivalence had no place in movie reviews, the algorithm was developed only to score either a positive or negative value. Ignoring the third dimension to language—ambivalence--is the fundamental flaw in these types of algorithms.
Ambivalence can lead us to experience, which in turn can potentially result in affect. Neutral language provokes thought by not overstating its intent (Hayles-Gilpin 16); however, sentiment analysis by design provides words and sentences that overtly display their intent. Neutrality transcends mere opposed poles--negative and positive, bad and good--by resisting obvious interpretation. Ambivalence creates experiences (Dewey 86). The space of unknowing places us in an in-between state, not knowing whether or not the sentence is positive or negative, which can cause a visceral reaction: “Affect arises in the midst of in-between-ness: in the capacities to act and be acted upon” (Gregg and Seigworth 1). Neutral language lives in the in-between, especially when isolated within a single statement. An ambivalent sentence within a paragraph might well be mistaken for exposition; but on its own, its truth and meaning are open to question. When we bring a neutral sentence out of context, we release it from the boundaries of exposition and allow its words to generate ambivalence (Kutas and Hillyard 1980).
The machine learning algorithms provide us with out of context information, measuring cognitive decline through keystrokes on a keyboard or determining the age of a person from a single picture. But what if we looked at how those patients felt the days on they came in to do the study? They could keep a diary and compare how they felt to how they performed. These efforts might give a better sense of the affective context of these patients. Could utilizing neutrality within an out-of-context sentence provide a similar way for an algorithm to better capture these dimensions of human experience? Some computer scientists claim that it is impossible for a computer to identify affect:
It can be argued that analyzing attitude and affect in a text is an "NLP"-complete problem. The interpretation of opinion and affect depends on the audience, context, and world knowledge. Also, there is much yet to learn about the psychological and biological relationships between emotion and language. (Shanahan, Wiebe, and Qu Xi)
However, these computer scientists often lack a theoretical background from outside computer science that might provide insights into affect and ambivalence. If we were to adjust the sentiment analysis algorithm to include neutral language, then we could potentially add these dimensions to its understanding of human emotion. Rather than looking at how accurate the algorithm can be, we can look at how to increase its understanding of human experience.
We often think of growing old in terms of retiring and moving to a home. Our conceptions of old-age are so uniform that we believe that we can predict what it looks and feels like to be old. One such example is Microsoft's "How-Old.Net." On this website, you can upload a picture, and then the website will employ a machine learning algorithm to determine how old you are. In my experience, the site incorrectly guessed my age by six years. While this website is fun to use, it also privileges this idea of out-of-context prediction through "big data." The algorithm learns how to determine age by drawing on a database with millions of pictures of people with corresponding years attached to them. By using all of this data, the program can make a fairly accurate guess at age. The "Human-Computer Interaction Markers" study utilized a similar algorithm to determine which patients were at risk for cognitive decline. As we’ve seen, the issue with this type of algorithm is that it does not capture a wider understanding of human experience. The rest of this paper discusses an attempt to extend this approach. It asks, what if we were to take statements that deal with old-age and run them through a sentiment analysis to produce ambivalent statements. Could we essentially teach the computer to understand old-age better? In an experiment very similar to a touring test, computer generated statements and human-made quotations were compared to explore how close the computer could get to producing affect and essentially conveying thoughts about old-age.
To compare how a computer performs in understanding the experience of old age to a human, five quotations from each entity (computer and human) were be generated through different means and then evaluated by different groups. A similar type of study that compared human and computer identification by Chellappa, Wilson, and Sirohey in 1995 utilized a set of nine different questions for each subject (computer and human). In the current study, two quotes were generated randomly from the corpus and added to the survey to act as a control: "There's always Something there to Remind me" and “Always watch the news before bed.” Out of the mining and sentiment analysis, a survey was produced that included twelve quotations in random order. Participants were asked to circle the quotes that generated any feelings. Different groups were surveyed to get a large sample set of data: the first group of thirty, 19-35 year olds on April 7th, 2016 (at Midtown 501-Apartments); a class of twelve college students on April 14th, 2016 (English 695 class); and a set of ten old-adults on April 20th, 2016 (Carol Woods). To consider different types of experiences from diverse cultures and ages, a vast corpus of material was utilized to generate the quotations.
The corpus of material that was utilized to generate quotations about old-age for the computer drew a variety of open-source websites including, Early English Books Online (EBBO), Twitter, and Project Guttenberg. The corpus contained over 10,000 books and 5,000 tweets so that there was plenty of material to find related to old-age. The hashtags "#oldage" and "#geriatric" were mined exclusively in Twitter because they generated the most material. Comparatively, the material used to produce the personal quotes comes from books that I have read about old-age and that I found particularly moving.
To efficiently process the material, the books from EBBO and Project Guttenberg were text mined for sentences that included words that had to do with old-age or aging. These terms were found by text mining the New York Times website, Google Books, and PubMed abstracts. The following graphs were generated to find frequently utilized words relating to old-age: Pub-Med: information mined through the Ruby script text.rb and visualized with the R script Bio.R.
Google Books mined through the N-gram website: https://books.google.com/ngrams
Graph Generated Through the New York Times API: http://developer.nytimes.com/docs
These eight terms (mature, antiquated, declined, ripe, aged, ancient, geriatric, and elderly) were used to text mine the corpus through the script context.py. Combining the resulting sentences with the tweets created a curated corpus of material relating directly to old-age. A machine learning algorithm (basicsentimentanalysis.py) was utilized to determine sentiment. This algorithm was trained using over 25,000 tweets that were coded by a few computer science students who determined the sentiment of all of the words included in the tweets. They scored each word as either positive or negative to a particular degree. The algorithm was then modified to find ambivalence by averaging out the negative and positive words in the sentence to a value close to zero (sentiment_analysis_mod_new.py and sentimental_tweets.rb). After this modification, the algorithm was fed the old-age corpus in order to identify neutral sentiment. Ten of the corresponding quotes were then randomly selected to be a part of the survey.
The 52 participants looked at a total of 1,040 quotations and chose 291 quotations that evoked some affective response. Out of those 291 quotes, they chose 184 quotations from the human generated, and 101 from the computer generated sets (6 from the control group). The human generated quotations made up 63% of the chosen quotes. However, this did not offer a decisive selection preference of human over machine, as it was a slim majority. It is indicative of some additional work to improve the algorithm, but the experiment was successful in generating quotes that provided a moving experience for those reading them, demonstrating that modifying the algorithm can provide a new dimension of information. The most popular computer generated quote produced 42 responses, which was, “We don’t stop playing because we grow old; we grow old because we stop playing.” This particular quote came from the Twitter corpus. Concerning ambivalence, this sentence is the perfect example of the tension between negative and positive words, between “don’t” and “play.” It is not an expositional statement but was still determined to be neutral based on the algorithm.
This experiment in adapting a machine learning algorithm has demonstrated a novel approach to investigate the ways in which we can modify these programs in order to obtain a clearer picture of human experience. The technology often obscures our means of generating knowledge. When one pries open the ways in which algorithms generate information, one often finds behind all the math and statistics, there is little accounting for affect, experience, and context. Context is important; it is what can bring us back to the human in a posthuman world of computation. We must comprehend how computers provide us with knowledge, specifically when it comes to the experience of old-age and medicine. Without intervening to ensure that computer-generated diagnoses consider contexts, experiences, and affect, we can forget the human in medicine.
Please Circle the Quotes that made you feel something:
1. “Daniel and Christopher remain friends long into old age”
2. “We don’t stop playing because we grow old; we grow old because we stop playing”
3. “Like water, quality seeks its own level”
4. “Why do old men wake so early? Is it to have one longer day?”
5. “Final perseverance is the doctrine that wins the eternal victory in small things as in great”
6. “There’s always Something there to remind me”
7. “I was just thinking that I am not very many years old, but that I am a century wide.”
8. “So ive spent literally all day decorating. I finally sit down with a glass of bubbles, deep heat and a hot water bottle on my back.”
9. “Bought an MPV, wearing shoes instead of converse, eating less meat, enjoying folk music; I’ve turned into my dad”
10. “We had a good run, and now it’s over; what’s wrong with that?”
11. “Always watch the news before bed”
12. “the race is long - to finish first, first you must finish.”
Burrows, J. Computation into Criticism : a Study of Jane Austen’s Novels and an Experiment in Method. Oxfordshire: Clarendon Press, 1987. Print.
Chellappa, Rama, Charles L. Wilson, and Saad Sirohey. "Human and machine recognition of faces: A survey." Proceedings of the IEEE 83.5 (1995): 705-741.
Dewey, John. Art as experience. Penguin, 2005.
Drucker, Johanna. "Humanistic theory and digital scholarship." Debates in the digital humanities (2012): 85-95.
Garey, Michael R., David S. Johnson, and Larry Stockmeyer. "Some simplified NP-complete graph problems." Theoretical computer science 1.3 (1976): 237-267.
Gregg, Melissa, and Seigworth, Gregory J., “An Inventory of Shimmers” in The Affect Theory Reader, Durham: Duke University Press, 2010 Print.
Hayles, N. Katherine. How We Became Posthuman : Virtual Bodies in Cybernetics, Literature, and Informatics. Chicago, Ill.: U of Chicago, 1999. Print.
Hockey, Susan M. Electronic Texts in the Humanities : Principles and Practice. Oxford: Oxford University Press, 2000. Print.
Jänicke, Stefan, et al. "On Close and Distant Reading in Digital Humanities: A Survey and Future Challenges." (2015).
Kremer, Ilan, Yishay Mansour, and Motty Perry. "Implementing the “Wisdom of the Crowd”." Journal of Political Economy 122.5 (2014): 988-1012.
Kutas, Marta, and Steven A. Hillyard. "Reading senseless sentences: Brain potentials reflect semantic incongruity." Science 207.4427 (1980): 203-205.
Levy, Frank, and Richard J. Murnane. The new division of labor: How computers are creating the next job market. Princeton University Press, 2012.
Moor, James H. "An analysis of the Turing test." Philosophical Studies 30.4 (1976): 249-257.
Murphy, Kevin P. "Naive bayes classifiers." University of British Columbia (2006).
Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. "Thumbs up?: sentiment classification using machine learning techniques." Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10. Association for Computational Linguistics, 2002.
Ramsay, Stephen. "Special Section: Reconceiving Text Analysis Toward an Algorithmic Criticism." Literary and Linguistic Computing 18.2 (2003): 167-174.
Shanahan, James G., Wiebe, Janyce, and Qu, Yan. Computing Attitude and Affect in Text: Theory and Applications. 2006. Print. Information Retrieval Ser. ; 20.
Wiebe, Janyce, Theresa Wilson, and Claire Cardie. "Annotating expressions of opinions and emotions in language." Language resources and evaluation 39.2-3 (2005): 165-210.
. Often these types of infinitely complex problems are called "NP-Complete Problems" and are impossible to compute. To know more about this subject, refer to Garey, Johnson, and Stockmeyer (1976).
 “Beth Israel Deaconess Medical Center Has an E.R. “Supercomputer” October 30th, 2015 Boston Magazine (http://www.bostonmagazine.com/health/blog/2015/10/30/beth-israel-superco...).
 A Turing test is a way to verify if a person can determine if they are talking to a computer or a human, for more information, please see Moor, James H. "An analysis of the Turing test." Philosophical Studies 30.4 (1976): 249-257.
 The text utilized were Hemmingway's Old Man and the Sea (1952), Umberto Eco’s The Name of the Rose (1980), Murial Spark’s Momento Mori (1959), Paul Harding’s Tinkers (2009), and Garth Stein’s The Art of Racing in the Rain (2008).