Blog Post

Is All Big Data ‘Messy’? What Questions Must Researchers Ask Before, During, and After Crunching the Numbers?

Is All Big Data ‘Messy’?  What Questions Must Researchers Ask Before, During, and After Crunching the Numbers?

Lightning talk given at the National Data Service Consortium in Boulder, Colorado, on June 12, 2014

This talk brings the interdisciplinary perspective of the social sciences, humanities, and digital humanities to data science and is a follow-up to our HASTAC May 28 “Big (and Messy) Data” workshop as part of a two-year NSF EAGER grant on data and cross-disciplinary collaboration and mentoring.   A key concern from this workshop that needs to be applied to our National Data Service is what my colleague and collaborator Richard Marciano has termed the “forensics” of understanding and interpreting big data.   If we are going to provide a national data service for researchers, we must include in that service useful questions that any researcher, in any field, must pose in order to fully understand  the biases, histories, and ambiguities of data, including the way that the inputs can distort the outputs an that all data requires interpretation and context.  






Credits:  Special thanks to Kaysi Holman and Jade Davis for brainstorming this talk with me and to Jade for designing the striking slides.)


1 comment

Nearly 37 years ago I worked with the Chicago Public Schools developing a desegregation plan. We discovered that the schools routinely used six - or more - different report cards, and that data were remarkably inconsistent over time as kids moved around the system. Eventually, the whole effort slid into what is now the University of Chicago's Consortium on Chicago School Research (CCSR), and the system adopted comparable report cards, attendance and on-time reports, and, eventually "common core" and testing standards.

In part because of the chaos we found in 1977, this "system" of today was a "reform." In part it is the economic engine for players like Pearson and the common core. And in part it became the foundation for tracking issues like attendance and later drop-out rates, since the largest part of the database was on whether they got to school at all, or when, or for how long. Ironically, as that Consortium so well documented, attendance was a more significant "cause" (or, perhaps, correlation) of dropouts, and dropping out was virtually the most critical decision for those who did.

I cite this history since it's now painfully obvious that many of these same features have complicated "big data" on education, and that there's a need both for consistency - from as early as they get in to as long as they stay, wherever that may be - and clarity in purpose and implementation. Now that technology has caught up and gone beyond the CCSR model, it's absolutely critical to build on what works rather than invent new metrics with no serious database foundation.

Incidentally, we also discovered that the Chicago data is now the largest, oldest, and most reliable on which other data might be based. Creating new "stuff" while ignoring the patterns - of attendance, age, language, family, income, location, school, grade and grades - as if college were a totally separate experience is both stupid and remarkably expensive. If the stuff doesn't exist from earlier educational achievement, the patterns inferred later aren't really patterns.