Blog Post

A Deep Dive Into HASTAC's Big Data: Welcome Postdoctoral Fellow David Sparks!

A Deep Dive Into HASTAC's Big Data:  Welcome Postdoctoral Fellow David Sparks!


If there is one thing has in abundance it is data—unanalyzed, unmined, clean, anonymized data.   We now have political scientist David Sparks here, the postdoctoral fellow on our NSF EAGER grant, to lead into an analysis of all this data in order to help us understand what we can learn about virtual mentoring, interdisciplinary collaboration, and a range of other insights.   David will be studying the content, users, and structure of while furthering his own research into social networks. 


Our EAGER grant is entitled “Assessing the Impact of Technology-Aided Participation and Mentoring on Transformative Interdisciplinary Research: A Data-Based Study of the Incentives and Success of an Exemplar Academic Network.”  David’s doctoral dissertation, completed at Duke University, focused on “Ideological Segregation: Partisanship, Heterogeneity, and Polarization in the United States."   He has substantial training in mathematical and visual social network analysis, and social network analysis is the central focus of his research agenda.   For one project, he used trace data from approximately 4 million Twitter users and over 500 elites to analyze political preferences and ideological leanings.   That kind of complex interconnection of data and social and political questions is what made David our top candidate for this Postdoctoral position.   He is particularly skilled in

organizing, modeling, and visualizing multidimensional, large-N data.


Outside of his doctoral dissertation and course work, David worked for several years as a statistical consultant to the Boston Celtics.  He is able to draw from multiple sources of multi-modal data and draw insights using a variety of methodological approaches, including computational linguistics, sentiment analysis and topic modeling.  


We don’t know what we’ll find from this search but we know it will be rich and interesting.  David’s insight, enthusiasm, and excitement over the “complexity” of makes us know we are going to emerge with some great information that will help us make even better.


David co-authors a blog, in case you are interested in seeing his other work: On his blog, he shares his knowledge of best graphical practices.    You can also find his work on a second blog: .   He’s a friendly, delightful new addition to the HASTAC central administrative and research team. We all look forward to what will emerge from his deep dive into’s data.   He’ll be presenting his initial results soon, at our HASTAC 2013 International Conference in Toronto. 


Please join us in welcoming David Sparks to HASTAC.  



1 comment

You're right about your associate's graphics - they can show lots more than what most people can ask....

And that, in turn, requires intriguing data with which to begin. The richest vein of data I know in public education is from the Consortium on Chicago School Research, which itself has an interesting history. In the early 1970's, Chicago faced a serious issue of desegregating a system already over 70% African American. The state, led by Malcolm X's half-brother, wanted the district to develop policies that offered students serious experiences beyond the racial boundaries of a system frozen by racism.

This ultimately forced the system to dis-aggregate itself from a network of regional sub-districts, and to create, thereby, one of the richest metropolitan databases in history. I was on the initial team facing this challenge, and we found, for just one example, six different report cards, each with different criteria, and each with different definitions of quality and achievement. Along with different definitions of attendance (by sub-district) and graduation, testing, etc., this forced the system to create a common language for many, many previously incompatible terms.

Eventually, the Consortium evolved to house and analyze those data. The city was (then) the second largest district in the nation, and the largest single data-pool, since all others had sub-district definitions. That means that there are now 40 years of common definitions.

One of the more intriguing results was a study of dropout behavior that indicated that grade-retention (taking the same year or course more than once) had a 90% correlation with later dropout. Now, correlations are not causes, but when they're that high, they cause one to pause. And the data were simple: attendance, retention, and graduation. Since most grade retention was "caused" by poor attendance, the "solution" was early intervention. They didn't do it, but they could have, should have, and now do know the ways to prevent a huge ratio of dropouts.

It would be fascinating to track the changes over time, across schools, and to actually measure the impact of grade retention, and the cost-benefit of early intervention to reduce retention. It would also be intriguing to explore the actual costs of grade retention - in simple dollar terms - and see where the system wasted its money. Not because Chicago's particularly bad, by the way, but because their data provide a bellweather to identify tactics that could be used by other districts, cities, and states - like this issue of retention....