Blog Post

HASTAC Conference Notes: Keynote by Josh Greenberg


Josh Greenberg, the Alfred P Sloan Foundation

“Data, Code, and Research at Scale”


His slides have all sorts of great images, and he says it will be online--one presumes at


An epistemographer tries to understand the history of tools that we use (he likes to build and do rather than just observe)

Disclaimer—this is not a speech about Sloan Foundation---these are his thoughts alone


Introduction by Dan Cohen (George Mason) (his former teacher and eventual boss)

            Worked with Zotero at George Mason

            Got a lot of archival content on website for NY Public Library

            Now with Sloan Foundation (head of digital technology program)

            His book: From Betamax to Blockbuster—how video cassettes changed and industry and a culture



Research at Scale

  • This idea is getting more traction
    • Telescopes let you see far, microscopes let you see small, now we are talking about a macroscope—that let’s you see big and complex
  • Katy Borner—she wants to build macroscopes (wants to empower domains scientists (researchers) to assemble these macroscopes themselves)
    • Tools that help you think big
    • Lots of people doing this kind of work with Twitter (i.e. Twitter firehose—can do real time study of trends)
    • OKCupid—they have a blog for their data scientists, and they publish fascinating bits of research (neat example here:
      • i.e. 35K couples who met on OK Cupid, and they mine that data to help understand human behavior
        • i.e. don’t ask “will you sleep with someone on first date?”.  Instead ask, “do you like the taste of beer?
  • So can make a big ol’ pile of data
    • Assumptions now go along with use of term, “big data”
      • What it is and how should be used
  • Sloan Foundation
    • They really like data (i.e. census of marine life, map of the stars, MoBeDAC)
    • All become base data set for other to study
  • Often spoken as impending data deluge—won’t know how to deal with it
  • Another term: data curation
    • Not just about archiving.  Also have to decide what to throw away, for example

What about code?

  • Deeply bound up with data through its production and use (i.e. usually focus on beans in coffee grinder rather than the grinder)
  • Google books Ngram viewer
    • What if want make claims based on this tool?
    • Can download the data
    • Article in Science hailed cultureomics
      • Ie. From people to buildings where book was stored, to the scans of books into digital files, placed within a corpus and then enter keywords for frequency
        • Code a key part of all of that
          • OCR code extracts characters
          • Cleanup code that fixes characters from OCR
          • another code that decides which characters are clean and can/should be analyzed
      • There’s a provenance chain to understand how we produce and use data
  • Who does the work?
    • Data science—another new term/concept
      • Some envision at intersection of applied math and engineering—to work with quantitative data at scale
      • Others see applied math and engineering coalescing into writing
        • Ability to tell stories with data
        • There’s a big demand for people that can do this (tell stories with visualization of data—using humanistic skills to study data)
        • #alt-ac?
  • Bunch of innovative projects
    • Zooniverse—Galaxy Zoo project
      • Every galaxy got touched 10 times by researchers
    • Old Weather based on ship logs
      • Pulling out climate data and transcribing historical records
    • What’s on the Menu (NY Public Library has historical records)
      • Hard to OCR
      • So community of people transcribed them so could OCR
    • Wikipedia
      • More active knowledge construction
    • What are the mechanisms of visibility for people who jump from data collection/cleanup to analysis?
  • How we produce knowledge-epistemology
    • Two dominant modes of production of knowledge (built into our major institutions of higher learning)
      • Science (empirical)
      • Hermeneutic (interpretative)
        • Then you have InfoChips
          • Wants to be a data marketplace (help circulate data)
          • Is there an epistemology of big data? (modes of computational research)
            • What logic underlies it?
          • Consider The Fourth Paradigm—data intensive discovery (in reference to Microsoft)
          • Or “Screwmeneutics”
            • Open ended inquiry and exploration—browsing mode
            • (browsing vs. search (targeted))
  • What about trust?
    • Need more in this space
    • Notion of reproducibility (have to be able to reproduce empirical experiments)
    • Analogy
    • Empirical falsifiability : methods (trust comes from this)
    • Hermeneutic inquiry: provenance (where info came from lends trust)
      • I.e. issue of citation
    • Trust depends on systems of institutions in which knowledge developed
      • Our means of dissemination are out of sync with the methods of scholarly production
        • How you publish work severs the chain of provenance
        • There is not an unbroken chain when working with big data
    • What if we wrote scholarship like code?
      • Version control—everyone gets there by falling flat first and then realizing it's a good idea
        • Let’s you collaborate—jointly author code
        • Code is never completed—always evolving
      • Tagged release
        • Every so often, you say it is good enough and flag as a release (hence Version 1, version 2, etc)
        • As opposed to notion of publication
      • Not everyone gets to edit
        • Bug trackers do find problems, and then editors work with in open system
      • Forking
        • If doing all this in system of open code, you can fork the project when there are arguments about how to proceed
          • Link back to common ancestry remains (provenance)
  • The very technology that enables research at scale potentially enables new modes of dissemination
  • Research at Scale
    • Research projects that work at scale
    • There’s an interesting argument to be made about the broad landscape of research
      • Thinking about all research at scale
        • i.e. disaggregation of scholarly research  (from journals to new channels)
      • Analogy

                                                     Humanities : blogs


                                                    SS : SSRN (preprint)


                                                    Sciences : PLoS ONE (rapid publication)

                                                                      Need more conversations across these

  • Macroscopic methods of discovery, assessing impact
    • Focus on broad scholarship inquiry
    • In digital humanities—particular opportunities in this space
    • Often think of humanities as lagging behind
      • But digital humanities are providing a model for the sciences

Time for One Question

            Tara (McPherson, one presumes): she wants to know what his vision is for humanities and the arts

                        He is candid—knows less about arts as a domain

                        Different modes of knowledge production need be at table to discuss how deal with big data


Data is the next big boom in the industry

                        They don’t just want coders.  They want people who tell stories with data.

                        How can the humanities position themselves to train others to do that work?




No comments