Blog Post

New Project: A Script for Distant Reading

Hello HASTACers! I'm delighted to be able to now follow the blogs of all the new HASTAC Scholars. I wanted to share a very tiny project I've just started to get my feet wet in the world of information reduction/distant reading and to solicit your advice. I'm teaching a composition course this fall and during our discussion of methods of "close reading" we have been using a method called, amusingly enough, "The Method" from David Rosenwasser and Jill Stephen's Writing Analytically.  "The Method" requires noticing repetitions, near-repetitions (called "strands"), opposite pairs of terms, and words or images that do not fit. From these data the reader/writer will then begin the process of interpretation. After going through this slow process several times I thought I would write a quick Perl script to assist me in locating repetitions and similar words. I've only run my script on a few small texts, but the results seem interesting. The next steps involving improving the logic behind the collection of strands and running on much larger texts.

Useful? Any thoughts on how best to read from a "distance?"


Here is some (truncated) sample output from T.S. Eliot's The Waste Land:

[jed@moonunit] > Waste_Land.txt
reading connectives table...
reading dictionary...
File: Waste_Land.txt
Date: 02/11/2009
Lines: 494
Word Count: 1577

** Top twenty repetitions: **
  16 water
  10 nothing
  10 rock
  10 o
   9 dead
   8 mountains
   8 shall
   8 jug
   7 upon
   7 eyes
   6 dry
   6 city
   6 red
   6 sound
   6 goodnight
   6 night
   5 wind
   5 winter
   5 rain

** Strands: **
give: engage, gives, heap, known, leave, pass, say, shell, shower, snow, surrender, tell, voice, whisper
think: bear, consider, event, expect, feel, found, free, guess, knock, pass, play, sense, thinking, together
dry: alive, arid, brush, cracked, desert, drying, dusty, fire, high, keep, sandy, smoke, sun
whisper: blast, brush, cry, feel, kiss, murmur, peal, say, snarl, stroke, tell, voice, whispers
clutch: agony, bear, boat, bore, car, grow, heap, pocket, stay, together, turning, wreck
nice: aware, beneficent, careful, ears, elegant, famous, kind, narrow, pleasant, propitious, welcome
walk: beat, foot, forced, ground, house, limp, mince, tour, walked, walking, walks
sound: elegant, famous, fit, kind, mouth, nice, reach, say, tell, voice, whisper
found: bear, chair, event, free, ground, knock, pack, pass, ring, throne, together
pleasant: awful, beneficent, bright, flushed, glad, golden, high, kind, nice, propitious, welcome
fell: awful, barbarous, brain, cast, cock, desert, flat, floor, knock, pick, table
wind: bear, beat, cast, change, gammon, knock, rubbish, tent, wind's, winding
pass: ahead, along, engage, known, leave, passed, play, stand, surrender, tell
feel: brush, consider, encounter, expect, guess, kiss, pass, sense, stroke, whisper
tell: count, known, leave, pass, plain, say, stand, telling, voice, whisper
pocket: bear, beauty, grow, house, money, pool, shell, sink, stay, together
say: bear, ipse, record, stand, swear, tell, testimony, voice, whisper, yes
bar: alley, blank, blind, count, keep, knock, leave, marble, pass, pick
foot: beat, brush, dog, drift, forced, limp, mince, sail, swing, walk
rattle: bear, blast, chatter, clatter, crowd, cry, dig, fire, peal, rattled
beat: bear, beating, beats, cast, change, foot, pound, swing, tour, walk
sweet: baby, beneficent, dear, golden, kind, lover, nice, pleasant, propitious, welcome
desert: arid, dust, dusty, fell, flat, high, rat, sandy, table, walk
stand: bear, bill, house, keep, lie, pass, reach, sit, stay
meet: connect, encounter, face, feel, narrow, pass, stand, stream, together
kind: beneficent, blood, elegant, famous, human, nice, pleasant, propitious, welcome
peal: blast, clatter, cry, murmur, rattle, ring, snarl, tolling, whisper
humble: bear, beat, keep, knock, least, lowest, poor, third, walk
hand: cards, doing, hands, house, queen, shore, straight, stroke, tour
sun: brush, dawn, fire, fortnight, month, moon, smoke, sun's, sunlight
stay: coffee, house, keep, relief, sit, spoke, stand, staying, stays
brush: feel, fire, foot, kiss, pick, smoke, stroke, sun, whisper
noise: blast, blind, bloom, clatter, cry, drift, peal, rattle, snow





The first thing I would do is look around to see who else is doing similar work. It's too difficult to reinvent the wheel. Since there's a lot of work already on text encoding and some of the things you want to find (like opposites and incongruities) require conceptual encoding of a text, I bet there's somebody who's already working on it. Either way, I'd add the ability to analyze multiple texts together rather than just one at a time. Also, rather than rolling your own language code, why not look at these Perl modules? Then you can do things like find stems, pluralize, conjugate verbs, etc. without having to solve all the special cases yourself. Either way, for your strand analysis section, I think a lookup table would be more effective and easier to modify than a long string of if statements, which will quickly become difficult to read and maintain as you add more capabilities.


Thank you Michael. Good pointers and advice. Lingua::EN* looks very useful. As far as reinventing the wheel goes, I'm not necessarily opposed. I've thus far been unable to any good and easy to use methods to do what I'm looking for.


I recently used MALLET's topic modeling tools to do a similar kind of analysis on a collection of about a hundred late-c19 popular travel books. I'm not familiar with "The Method", but I've found topic models useful in finding patterns and relations in large corpora: you give MALLET a set of documents (in my case each "document" was a paragraph from a travel book), and it returns a list of "topics" (which are just distributions over words) and associations between topics and documents. So in my case one of the topics it discovered was a "gambling" topic that contained the following words:

game casinos tables play hands money slot games roulette table played hand playing chips learn skill

I can then ask for the 100 or 1000 documents that are most strongly associated with this topic, or I can get a sense of similarity between two documents by comparing their topics. The handy thing is that the topics provide abstraction away from vocabulary: I can find passages related to gambling even if they don't happen to contain the specific words "gamble", "bet", etc.

Topic modeling doesn't require any preprocessing in the way of lemmatizing, part-of-speech tagging, etc., but if you are shopping around for toolkits to do those kinds of things, I'd recommend MorphAdorner (although I agree that rolling your own can be more fun).