Blog Post

799 Million Moby Dicks

Jim Gray, the database software researcher who disappeared at sea in 2007, predicted the paradigm shift in science that would arise from the massive amounts of data that can now be collected and must be analyzed in all aspects of the material, social, and cultural world.  Gray's  colleagues at Microsoft celebrate his work in distributed computing systems in a new collection dedicated to his honor, The Fourth Paradigm:  Data-Intensive Scientific Discovery  edigted by Tony Hey, Stewart Tansley and Kristin Tolle. 


New scientific tools that are part sensor and part computer take in unfathomable amounts of information.  These include the Australian Square Kilometre Array of radio telescopes, CERN's Large Hadron Collider, and the Pan-Starrs telescopes.  These generate several petabytes of information every day.  So how big is a petabyte of information?   According to John Markoff in a NY Times article today, "A Deluge of Data Shapes in a New Era in Computing," a petabyte of data is the rough equivalent of 799 million copies of Moby Dick.   Put that on your reading list, fellow scholars!


Of course, to sort through the daily petabytes requires new tools.   And they aren't all expensive.   Cheapt closers of computers can manage and process data at all sorts of speeds.   Until recently, for example, you could run Linux on your Playstations and, for less than $500, you could basically make a cluster farm that could process petabytes like a supercomputer.   Other tools help you search and sort and analyze. 


And data isn't just for scientists any more.   Indeed, the plethora of data changes knowledge on every level and it changes everyday life.   As I and many others in HASTAC have been saying since the beginning of our run in this world, if humanists keep dismissing "data" and "evidence" as mere "positivism," we miss one of the great opportunities of our era.   There really isn't such a thing as "data crunching" in the end.  Data isn't just "crunched" (what does that mean?) but has to be interpreted, understood, put into context, analyzed along side other data, and in many other ways put through all the paces that humanists are expert at.   The divide of "theory" versus "practice" or "the theoretical" versus "the empirical" has long since been shown to be bankrupt.  

How much data comes from 100 million Facebook users updating their status every day?   Who are those users?  How do they see this Web of a world they are co-creating but that is yet available, constantly for exploitation.  How do identitarian matters of race or gender or sexual orientation play out in virtual spaces such as Facebook?   How does performativity work on line?   What about the concept of a "self" when it is clear that we can enter and leave the Web in many guises, constantly overlapping and yet distinctive?


And what is the relationship of data to communication?  Before I exit this blog, I will use one of the widgets to share it with my Twitter network and my Facebook friends.   What does that widgety activity mean?   How does it change the cycles of authorship, production, consumption, publishing, and distribution?  What does it do for privacy?  How does it shape the public sphere?   Information isn't static.  It is Webby, as Ruby reminds me, with each think I find part of a web of information to each other thing that you find and that we can mashup and mix together.   Data is social.


So much data, so little time.   Humanities in a Digital Age are vital, urgent.  Better get busy.  790 million Moby Dicks await us. 




I really enjoyed this post. I especially loved your statement:

"Data isn't just "crunched" (what does that mean?) but has to be interpreted, understood, put into context, analyzed along side other data, and in many other ways put through all the paces that humanists are expert at.   The divide of "theory" versus "practice" or "the theoretical" versus "the empirical" has long since shown to be bankrupt. "

Probably because I totally agree!! To understand that even "positivist" science is dependent on human interpretation, and that all kinds of research paradigms in fact ask different questions and form a dialogue around a phenomena of study... these are important points for us stuffy academics who sometimes forget that our own research methodologies are not infallible.

Your point about data is also very timely.  In my own research I am following the Facebook activity of about 250 urban youth, and only downloading their posts once per week. Even still, that is generating hundreds of thousands of data points!! This IS an exciting time for research.



Thanks!  And what a great project.   My frustration is through the roof with humanists who don't understand how vital we are to this world we live and teach in. . .    Until WE understand this, our institutions will not.  


I think what you're highlighting, Cathy, is the critical epiphenomenon of modern data creation and data availability.  It will all* be lost, if extrapolation of present data growth rates hold out, and so what we choose to focus on, analyze, "crunch" or build our own models with, will always suffer from that critical limitation.  It's the ultimate defeat of the quantitative by the qualitative.  Here's what I mean:  We're producing a jillion data points a day right now, so that for every 799 Moby Dicks there will be 79,999 a day the next time you look, and any type of automated attempt to gather, sort and store such data will be itself a mediation of the data, which will grow more mediated as the data continues its growth.  Ultimately, even possibly right now, the scholar's decision to analyze data will be so seemingly arbitrary in its selection that it will, I think, fold back into being a humanistic, qualitative choice.

*All, meaning most, but to such a degree that you might as well refer to it as all.  Even if every minute's changes of Facebook, World of Warcraft and Google Earth are stored (if the storage capacities are possible) the resultant store would be so large that it would be fundamentally impossible for a human being to explore it.  Therefore, the data itself will never be addressed, except in infinitesimally small sample sizes, and instead the data as a body will be dealt with through a host of mediators.  We have no scholars of Global Literature, because a scholar has not the capacity to explore the global corpus (due to size, language and access restrictions) but such a theoretical scholar would have an easy time of things, quantitatively, when compared to a putative Facebook scholar, who would be dealing with an even larger corpus equally linguistically enshrouded (but in this case both by human and software languages).