This month marks the activation of CERN?s Large Hadron Collider (LHC) in Geneva. As one of the major enabling technologies for managing the large amount of data generated by the LHC, Grid Computing has also returned as a topic of interest. In her introduction to the last HASTAC Scholar Forum Ana Boa-Ventura discussed the role of humanists in High Performance Computing (HPC). Taking up this challenge to locate, what Ana calls ?the humane?, in computational science has a very personal dimension for me as the last exciting project I worked on, prior to deciding to reinvent myself as a humanist, involved developing software and workflows for distributed and Grid Computing.
While the LHC represents the extreme high end of data-driven science (perhaps generating up to 27TB/day), there are many other scientific fields generating large amounts of data. Another characterizing feature of the experiments being run at the LHC is the size of the research teams. These teams are often spread across national boundaries and require local copies of datasets. To summarize: it is now common to have sizable quantities of raw data, team members working on both common and local datasets, and complex workflows requiring interaction across both application platforms and staff.
In my previous field of cognitive neuroscience we had a somewhat analogous, if only much smaller in scale, environment. The initial problem we selected for prototyping on the Grid was the creation of a probabilistic brain atlas template. This template would combine and average the high-resolution MRI images of a large number of subjects to define a common 3D space for this subject pool. Individual subject images would then be aligned to this common space, a process that would reduce amount of warping required to ?fit? them to each other.
If we could successfully run the workflow to create this atlas from a small subject pool, say around 40 subjects, on the Grid, then ideally it would be scaled up to run across the entire archive of datasets from a peer-reviewed repository of fMRI. Data repositories contain collections of 4D fMRI (functional MRI) data contributed by investigators across the globe using a variety of MR scanners and subject pools. From simple workflows with 3D high-resolution data, larger-scale experiments could then be constructed across experimental groups. The real interesting questions could then be asked from specific data object, for example: What applications produced this output file? What were the run-time arguments? Which subjects were used? How do these subjects identify their gender or race? Was this subject right or left handed? The data do not necessarily need to be on your system?s disk; in fact most often they would be located on the site where the computation was performed. These are termed ?virtual data? and point to them with records in databases or directories (perhaps with a URL). Of course cataloging and reading metadata is nothing new; Apple has long provided metadata alongside your other files in the form of resource forks.
Much of my initial excitement with Grid Computing faded when attempting to bring these technologies into the lab. Neuroscientists in particular love ready access to their data. Having direct (file system-based) access to data remains essential for the rapid reorganization and modification used in the traditional analysis pipeline. One new (debatable) and interesting development that might address these needs is Cloud Computing. Clouds are currently being used to serve instances of virtual machines instances users. Rather than exporting services, cloud service providers (like Amazon?s EC2, provision virtual machine (VM) images across large networks of storage and CPU resources. While clouds accomplish the goal of dynamically presenting powerful, networked computers directly to the desktop, the entire interoperability layer seems to have been removed.
Beside the obvious (virtual) material studies questions one might ask of both metadata and ?virtual data,? the collaborative components of scientific experiments performed on the Grid should be of great interest to digital humanists. Workflow authoring tools like Taverna are being used by groups such as the NIH?s Cancer Biomedical Informatics Grid (caBIG) to compose, edit, and run collaboratively developed workflows. The computer systems that these workflows run on are also potentially incredibly diverse, with different operating systems, CPUs, performance characteristics (interconnect, memory bandwidth, etc), owners, and software stacks. I?m not sure how well material culture studies can be mapped on to the concept of ?virtual data? and my suggestion here seeks, in part, to provoke some response to this question.
[The Flickr image is from a pre-print of a recent paper I contributed to on the work described in this post. The SwiftScript site maintains an archive of papers and references on "virtual data."]