Digital Documentation of Art workshop and fighting bit rot
I spent the wee hours of Saturday morning at the Maine International Conference on the Arts, an event that brought a broad range of artists, educators, and archivists to the UMaine campus. Along with my frequent collaborators Richard Corey and Sheridan Adams I held a workshop on digital documentation of art.
It's a topic that really could have gone for much longer than the two hours we spent on it and ranges from philosophical questions of who can make authoritative statements about art to the incredibly practical issue of how long I can trust my hard drive to keep working. We touched on both of those during the workshop as well as many other topics, and even got our group to document an improv performance by The Focus Group so they could try out different techniques with the various recording devices we had on hand.
I stuck to the more pragmatic side of things, pulling from part of my digital curation curriculum and focusing on how to best make sure that whatever documentation you do create for your art will still be viable when you go to look at it ten, twenty, or fifty years down the road. This is a big question in general, not just for the arts, and digital humanists need to take it seriously or they may find that much of their career's work is gone in an instant. The result was this cheat sheet that I handed out (click for full PDF):
Of course this is a lies-to-children version of digital archiving, and even then requires a bit of explanation before it makes sense. The top chart breaks file formats into three broad categories: work, access, and archive. Work files are what you use to actually produce whatever it is you're doing, often in proprietary software and formats. It's always good to keep them around–and in my open culture opinion, distribute them–but they aren't necessarily going to be reliable in the future. Access files are what you most often distribute to others and these examples are specifically geared toward the web. They typically will be lower quality than archival files but sufficient to get the point across. The goals of an archival file are zero information loss, longevity of file format, and the ability to be transcoded to produce additional access files as needed–damn the file size and full speed ahead.
I've found the strangest concept for most people is the idea of a container file format and the codecs that are used in them. The file formats in the top chart are not enough to make a complete decision about how to save your work, particularly for video and audio data. You also need to take into account the codecs that are used to compress that information within certain types of files. Those codecs are represented on the middle axes, with audio on top and video below, and their placement is somewhat subjective for several reasons. This chart takes into account things like the openness of the data format (proprietary is an archival dead-end), its popularity (obscure formats are less likely to have future users), and its render quality (there's little point in saving a file that doesn't represent your work). For the most part you want your archival materials more toward the lossless side of the spectrum and it's acceptable to have your access copies closer to the lossy side, though storage and network speeds are starting to obviate the need for that compromise. It's important to note that many of these codecs have additional settings that allow for different qualities – for instance, JPEG2000* has both a lossy and lossless mode, as do several of the other codecs listed.
The bottom part of the chart is all about physical media and how long you can expect them to be reliable. This is perhaps the most fictional part of the graphic. The reality is that these reliability considerations are heavily dependent upon manufacturer-based MTBF (mean time between failures) and your media may last for twenty minutes or may last for decades–assuming you can find a system able to read it decades from now. A more honest chart would probably say that the only rule is to keep as many copies of your important files in as many different places as possible. The old rule in IT is that a minimal backup solution requires an active file that you have access to on a hard drive, a local backup that is probably archived on removable media, and an off-site backup that you can go back to when your data center burns down (IT folks are cheery optimists). The key is to make sure that you don't just depend on one backup. Instead, make sure you migrate your old files forward whenever you get new systems and storage devices. If you do that you don't need to worry about your old CD-R rotting because you copied the data to your shiny new SAN device and now have added redundancy.
* JPEG2000 is primarily a still image format, but it is increasingly being used to store individual video frames at high quality. These frames can then be played back in order by an MXF file. This is a different technique than is used by most video codecs that compress parts of different frames together and results in completely lossless video storage.