An interview with data artist Glauco Mantegari, a specialist of linked data who works with humanists in Mapping the Republic of Letters and the Humanities + Design lab on data / linked data projects. Glauco talks about his experience finding and cleaning up datasets for visualization and archiving.
How would you describe what you do? Are you a data scientist?
Partly, I am. I analyze data, inspect data, and transform data in order to find meaningful patterns. But the main point of difference that I have with data scientists is that my work is not very focused on statistics in the formal sense. I feel closer to the definition of data artist, a label that identifies professionals who try to find meaningful patterns in datasets and to communicate what they discover to a wide audience. Data visualization and interactive storytelling are a very important component that makes it possible to communicate to an audience of non-specialists.
What kind of patterns are you looking for? How do you find them?
The project that I am working on right now is Mapping the Republic of Letters , which is a digital humanities project. I was looking for connections between people, places, and time in order to help explain Early Modern networks. I specifically investigated Voltaire's scholarly activity, starting from data that the project had on Voltaire's correspondence (see a related project by Bugei Nyaosi) from the Electronic Enlightenment project in Oxford University. I decided with Dan Edelstein, the PI for the Voltaire project, to try to expand this analysis by using publication data. We analyzed the geographical distributions of Voltaire's publications. If there were significant clusters of publications in certain geographical regions, we would compare these clusters to correspondence data.
We based our work on the spatial and relational dynamics and how they change over time. This means that we tried to see data from various angles. Trying to find patterns and analyzing data is a creative work: based on the dataset at hand, you need to figure out which approach could be used to integrate, analyze and visualize the data. And you also need to think about how datasets overlap and how one dataset (like correspondence information) throws a different light on others (like publication data). One dataset might give us more precise temporal information, such as when someone moved from place to place and another might give us a better sense of where the person was exactly.
Glauco at the Stanford Humanities Center 2012, Photo by Nicole Coleman
Your work is very technical. How can humanities scholars who are not data artists make use of data in their own research?
My activity is technical insofar as you need to write specific scripts to e.g. retrieve and integrate data and translate them into different formats. But there are also new tools and easy to learn programming languages that are making these tasks much easier. For example, Python is a language that many non-computer science people are learning because it is intuitive and powerful.
There are also ready-to-use tools that make the creation of visualizations easier with little technical expertise. On our team, Nicole Coleman has been working on the development of this type of tools for scholars to visualize the larger and larger amounts of data that are becoming available. When we visualize data, people without a technical background can obtain insights and find interesting patterns in large datasets. They can also find gaps and anomalies that can be studied and perhaps explained using the more traditional methods that humanists have been using for decades, or even centuries.
What do you think of the digital humanities?
The question of digital humanities is an epistemological one. I don't see digital technologies as a complete substitute for traditional methods. Even in domains such as artificial intelligence, there are many approaches that can help us answer questions through computation. But it is important to define the questions first.
I do hope that technical tools are used more by scholars in the humanities. The technology is changing rapidly. In the near future, we will have extremely powerful applications that do a lot of the analysis and visualization work for scholars. These applications will perform complex tasks that once required programming skills.
Using these new tools will change the methodologies a bit - as they have in other fields like the social sciences. This is the epistemological effect of new technologies and processing large quantities of data. But the meaning and the final interpretation of the data cannot be created by a computer alone. It is the scholar who will create questions and interpretations, and in doing this it will be more and more supported by new technologies.
I know that you have worked with datasets from the University of Oxford and the Bibliothèque Nationale de France a lot this year. How are cultural institutions and libraries using large datasets?
Big institutions are really pushing data collection and integration ahead. The linked data movement is making data available and accessible to everyone for cultural reasons. Big institutions like university libraries are creating a cloud of interconnected datasets. This is a huge step. There has been a lot of research in the cultural heritage domain in the last few years about the semantic web and linked data, e.g. by the Semantic Computing Research Group at the Aalto University in Finland. I think that large integrated datasets are the future. This is the direction that we are moving in. Systems like graph databases are only pushing integration farther.
Linked data is the first step towards international integrated cultural datasets. Issues like data quality and attribution are also being worked on.
If you could recommend just two or three applications or tools for humanities scholars, what would you recommend?
There is no single answer to this question, because the choice of a tool depends on the aim of the research.
Nevertheless I think that OpenRefine, is today the most powerful and versatile tool for working on data with limited technical capabilities. It serves a wide range of common tasks such as data inspection, cleaning, transformation, and linking. Some of the humanities students at Stanford have been learning simple expressions using the GREL language embedded in OpenRefine and this has been very useful for them to filter and transform the data. Furthermore, extensions and APIs are being developed, making it possible to perform a number of additional operations on the data and to script OpenRefine.
I hope that more humanities scholars learn how to use basic tools, and maybe learn some scripting. A small amount of programming skills can be very empowering. It is surprising how easy it can be.
The third part of this post will highlight projects that use linked data (including those suggested by the HASTAC community).
Do you have a data-intensive project that might benefit from the use of linked data? Do you know of any smaller projets that make use of linked data? Any ideas for how to use linked data? Linked data resources that everyone should know about?