Preserving Personal Projects

 

Last week, the members of our graduate seminar focusing on digital humanities were happy to hear from Dr. Amy Earhart of Texas A&M University. She presented us with an interesting discussion of how she has dealt with issues of digital preservation. She showed that while the internet opens up more space for the proliferation of the works of underappreciated and unknown authors, without serious consideration of how digital resources can be preserved, these texts can be re-relegated to the realm of 404 perdition.

As I embark on a personal project, I am wondering what is the best way to preserve my work. My plan is to use GIS technology to map references made to different locations in the 1991 novel Mala onda by the Chilean author Alberto Fuguet. The work provides a rich source of geo-locations, but rarely explains the socio-historic significance of these places. I want to create a map that would both visually represent the connections between text and geography and provide more information about the specific places.

(A very early iteration of what the project will look like using the online version of ARCGIS can be seen here.)

I have two questions I am wondering about as I begin this project:

  1. What is the best way to save information from web sites in case they no longer exist in the future?
  2. What are the realities of a long-term preservation of this project?

Currently, I am using Zotero to capture web pages with information I find important and that I worry could be lost in the future (like this site listing personal and government accounts of what happened to victims during the 1973 military coupe in Chile). Is there are better way to preserve this information?

Also, this project is driven by my own personal and professional interests. I am solely responsible for the creation and maintenance of the page. As we build projects like this, what is the feasibility that they’ll continue to exist long-term? Without being attached to any established, funded research group, is my work, or others like it, irredeemably doomed to be lost?

Bender

Preserving Personal Projects

Hello,

It happened that I asked the same question on the same day in my post  where I highlighted two  'old' archives, part of long-standing institutions, and new personal initiatives which are complementary -at least - or bring into focus new thematic collections or apply new techniques, i.c. digital photography.

Photo-Archives, Old and New

...These are serious initiatives, lasting hopefully for a long time, but they are of course not permanent. At the end, should all the information be lost in the cyber cemetery? 

A possible solution? UNESCO has an Archives Portal.

Could UNESCO act as a depository, under well-defined conditions?

Regards,

K. Bender

rconsoli

Internet Archive

The purpose of the Internet Archive (http://en.wikipedia.org/wiki/Internet_Archive) seems to be exactly this.  Their archiver bot, Heritrix/3.1.1, just came to my site, Squinchpix.com

coblezc

Web Archiving

 

Choosing a archiving tool depends on why you want to preserve a site. If it's just to remember how it looked at a given time, then Zotero's screen capture is a simple, easy solution. If you're looking for some serious archiving, then Heritrix[1] is a solid bet. It's standards-based[2], and like Robert mentions, used by the Internet Archive as well as the Library of Congress[3]. There is also a thread at DHAnswers[4] that might might give you some more ideas. 

 

1. https://webarchive.jira.com/wiki/display/Heritrix/Heritrix

2. http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml

3. http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html

4. http://digitalhumanities.org/answers/topic/how-can-you-archive-a-set-of-established-blogs-together

coryduclos

Thanks!

These are all great responses! It seems a little overwhelming, but I guess it's important to start asking these questions up front to prevent total loss later.

rconsoli

Is Heritrix really what K. Bender meant?

I was happy to hear from Zach but I have questions about the suitability of any crawler (even though I appear to have implied it myself) for the purposes which K. Bender suggests.  If I understand him correctly he wishes to have all the .html and all other functionality preserved as an archive so that the site continues to function normally even though it may be at a different URL.  This must include all incidental code (.php, .asp, etc.), databases, and all other actual site material such as images, .pdfs, or whatever else is served by the website.  Without that extra material no site would function and I believe that it is that which K. Bender wishes preserved.  If I am mistaken I am sure he will say so but I cannot comprehend that 'archiving a web site' could have any other meaning.  As I read more about Heritrix I came to realize that it's quite improbable that it could perform such a far-reaching function.  I believe that K Bender is actually suggesting some mechanism for turning over, when it is appropriate, ALL site materials to an organization that would continue to maintain and run it.  Is Heritrix more than just a crawler?  I hope that  the rest of you will continue to add to this comment chain; it's possible that I've completely misunderstood what is at issue,

Best

Bob Consoli

SquinchPix.com

rconsoli

Corrections and Expansions

 

In my previous letter I said the following: "all the .html and all other functionality preserved  as an archive so that the site continues to function normally even though it may be at a different URL" I see that this is exactly what the Library of Congress and other organizations actually ARE doing.  I looked through the code of a sample page of .html from the LoC and satisfied myself that that is, in fact, the case.  The sites most successfully archived consist of those which are entirely .html with associated images or sites such as blogs which are self-contained.  Even then all links in the .html text have been modified to point to LoC's own servers.  Perhaps this modification is done in an automated fashion.  I could not tell. Large and sophisticated sites (the ones you're probably most interested in preserving) usually involve some combination of .html and .php (or .asp) programs which generate .html dynamically (99% of squinchpix.com is created at run-time by .php); it is impossible to archive .html pages which do not even exist until run-time and, even then, only exist on the client's machine unless Heritrix (and the like) execute the .php themselves.  I sincerely hope that they cannot do this (although Google is sometimes convincingly accused of doing this); the very idea would fill a web-master with horror.  To successfully archive such sites would require that the archiving agency come into possession of the .php (or .asp) files and the ENTIRE DATABASES and the ENTIRE FILE STRUCTURE of imagery, .pdfs, etc., on top of which they run.  This would require the complete cooperation of the web master and would represent a significant investment of time and energy.   But now I really do await your further comments. Bob

Bender

What I meant and what I do not know

I am a simple user of a Google-website and Google-blog, both free of charge and I have no competence at all regarding the software background for running or preserving such internet facilities. I see others using more sophisticated website- or blog-providers, such as WorldPress. My point is that many of such personal initiatives may be worth preserving when the present owner wishes to stop his/her activity. I think the decision to preserve should be in the first place the responsibility of  the present owner.

From the discussion above, I understand that there might be several solutions, depending on the level of functionality still available once the site is no longer in the hands of the original owner.

In my case, I am waiting to understand more, possibly through examples, about the possible AND practical solutions - it means without the support of a software specialist at the side of the owner.

Of course, with the on-going sotware development any present solution may be outdated tomorrow.

Bender

UNESCO Conference The Memory of the World in the Digital age

This might be of interest .

UNESCO Conference:
The Memory of the World in the Digital age: Digitization and Preservation
26-28 September 2012, Vancouver, British Columbia, Canada
http://www.unesco.org/new/en/communication-and-information/events/calendar-of-events/events-websites/the-memory-of-the-world-in-the-digital-age-digitization-and-preservation/

rconsoli

'Crowd-sourcing' and the Tulip Mania

 

Dear K Bender,

First of all, an error by me. It's www.archive.org not www.archive.com (which is an arab language site).

I looked for your web site on the Wayback Machine but it's not yet present. I share your general goal which is that work on the Internet (whatever form it should take) ought not to be lost just because the site owner tires of it or loses the domain name accidentally or dies without specifying what should happen to the site.

I'm convinced that archiving web sites cannot be the whole solution. Much of the value in a web site is contained, as I know that I've said before, in directories to which bots have no access. For a bot-scanner to work as an archiver the target web-site would have to be entirely in .html which many web sites are not. Many web site pages don't even exist until run-time when they are created by .php or .asp code. Many such pages go out of existence immediately thereafter. Also there are many images, videos, and other media which exist in directories which are protected and not accessible to a mere bot. Frankly, I don't thing we can ever get around the problem of losing web products for the reasons which I named above.

Not to let the best be the enemy of the good: for web sites which <i>are</i> in .html an archiving bot <i>will</i> work. So scanning for archiving (and then retrieval by something like the Wayback machine) is a partial solution.

What can we do about other sites?

Let's look at this from two perspectives: How do we save valuable work? What can we do to elicit that work in the first place?

a. There's the Wikipedia approach (or wikis in general) which, in effect, aggregates the work of hundreds of thousands of contributors and prevents it from being lost. Now I can never be a friend to Wikipedia as it currently exists for two primary reasons. First, it cannot be quoted because there is no guarantee that the language will not change in the next few minutes. Second, articles are not signed so that you don't know whether you can trust the info. I will never accept the idea that a history of the Roman Empire can be 'crowd-sourced'. Such things must be written by authorities and we have to know who these authorities are. Nevertheless Wikipedia has the valuable function of <i>eliciting</i> work and preserving it. Another partial solution.

b. There are pictorial sites such as Flickr and Google's Picasa and others which contain hundreds of thousands (many millions?) of valuable photographs. Again, such sites <i>elicit</i> valuable work and then they preserve it.

Now. Consider that we are successful; we have successfully collected and stored hundreds of thousands (or millions) of sites with all their supporting material. Now what? How do we access just the material we are looking for from this ocean of data? In my mind that's the heart of the problem. There is no way (that I know) to organize all these disparate web sites (with their inevitably conflicting information) to work as a single knowledge resource. And if that's true then what's the point of continuing with this specific approach?

It's easiest to illustrate this with material from a single site. It's all well and good for some site like Panoramio to claim that they have a huge data base of pictures running into the tens of millions. It's a completely different question when it comes to using this mass of images. There we have difficult and time-consuming problems which only grow worse as the collection gets larger ('better'). Can you find all the pictures taken in Italy which are on Flickr? Yes, trivially. Can you find all the pictures that show columns? Yes, trivially. Can you find all the pictures taken in Italy that show columns? Not so trivial. You can conduct such a search but it only succeeds if the photographer has tagged (keyworded) the photos to say 'Italy' and 'column'. You can find a representative sample but you can never be certain that you found them all or even found a significant percentage. Can you find all the column pictures in Italy from the first century? No. When I search for 'Italy','column','1C' on Flickr I get seven results, six of which are grossly wrong. Here's a good example of 'crowd-sourcing' not working; no one can tag their pictures consistently with all the others and we can't expect that they ever would. Unless Flickr (or Picasa or SmugMug or whoever) has their own authoritative staff consistently keyword the pictures on their site then the usefulness of such a resource, no matter how gigantic it becomes, is always going to be marginal. This is a small example based on one picture site. What about using hundreds of millions of archived sites in a coordinated way? I cannot conceive that this would ever be possible. Individual web sites are more different from each other than keyworded pictures are different from each other. Far more different.

So. We need to elicit the work and preserve it. Ideally the work would be by named persons and it would not be modifiable except by those persons whose names appear on it. It would be edited for style and consistency. It would be organized and it would be large (i.e. the contributions of thousands of individuals). It would be extensive. It would be citable. It would last forever. What am I talking about?

We need an internet resource that takes the form of an encyclopedia. It would be the property of some organization that does NOT go out of business; some U.N. agency or other. It would solicit articles in various areas of scholarship in the arts and sciences. These articles would be signed. These articles would be edited. Candidates to write such articles would consist not only of recognized authorities but, perhaps, from those web masters who are agreed to be doing valuable work. You, K Bender, for example would be solicited for input on the history of representations of Aphrodite in Western art and your name would be used. Not only would such an encyclopedia feature articles but it would contain videos, photographs, etc., etc. Every resource would be consistently keyworded and searchable in several dimensions.

However impractical this may seem to you, my friend, this is the only way out.

Once we get rid of the collective and anti-scientific delusion that the 'crowd' can be a source of knowledge (the Tulip Mania of our time) the way forward will be obvious.

Robert H. Consoli

stepno

adding value to what's there

This is slightly off-topic, but perhaps the thread is worth spinning...

I've been working on a project that ties together existing Internet Archive and Google newspaper archive sites, along with YouTube video clips and other digital sources.

My research is related to a course I teach on "the portrayal of journalists in popular culture -- films, novels, radio and more..." 

The least researched of these are the radio dramatic series of the 1930s-1960s, so they where I am putting my research energy. (For the course, I have lots of other sources on films, novels, etc.)

Radio collectors, who originally sold and traded transcription discs and tapes, moved into the digital era sharing MP3 files via Usenet, and then over the Web. Now thousands of MP3 files of series episodes are "circulating" on the Web -- usually with their provenance lost in time. (That is, it's hard to tell who made the MP3, from what transcription disc, and with what background research on actors, writers, copyright etc.) Individual collectors and the Old Time Radio Researchers group, organized around a Yahoo mailing list, have uploaded whole series and fragmented collections to the Internet Archive, adding commentary with varying degrees of accuracy.

My project involves listening to series and episodes so that I can analyze and discuss the "journalist" characters who appeared in these serial or one-shot dramatizations, and looking into whatever facts I can find about the programs and any real events and persons mentioned.

My technique is to post individual episodes as audio-player links in a WordPress blog -- usually using the MP3 files at Archive.org, not copying them to my own site -- and posting some discussion notes, which I eventually compile into more general or analytical "pages," which I like to think I'll further expand into a book. Meanwhile, Google's (discontinued) newspaper and magazine archiving project has left behind scanned copies of small and medium size newspapers, which often  include listings or reviews of radio programs, or news stories related to some of the radio dramatizations -- such as the United Press series "Soldiers of the Press" or the sometimes-biographical series "Cavalcade of America." In those cases I take screen snapshots of the newspaper headlines and link them to the full copy at Google.

I also use IMDB.com and other film and radio sites as references for further information about programs, especially the radio adaptations of Hollywood films. In some cases, I embed YouTube movie players with trailers or full-length films.

I'd be interested in hearing from other researchers exploring similar techniques, as well as any interested in my subject matter: How the "new medium" of radio celebrated the "old media" of newspapers and magazines through the radio dramas and docudramas about print journalists.

Newspaper Heroes on the Air (jheroes.com) -- the home page, with blog/podcast items on individual radio episodes and a menu of "pages" on more general themes.

http://jheroes.com/about/this-site/  -- a general introduction

http://jheroes.com/at-the-movies/adaptations/ -- an overview page about radio adaptations of famous "newspaper" films, such as "His Girl Friday" and "Deadline USA" (and almost 40 more)

http://jheroes.com/real-life-reporters/walter-winchell/ -- a more detailed article about one journalist whose career drama wove in and out of truth and fiction.

http://jheroes.com/2012/06/11/journalist-carried-a-torch-for-lighthouse-keeper/ -- a radio docu-drama with a journalist as narrator (I'm still trying to track down whether the newspaperman in question was real, or created as a narrative device for the story of Ida Lewis)

http://jheroes.com/2012/06/09/dangerous-woman/ -- one of a series of radio dramatizations of stories about real-life United Press correspondents. I usually add Google newspaper-archive clips of their stories. For this one, I'm still researching more of the story-behind-the-story using her autobiographical books and other sources; the real mystery is about her first husband. He is not mentioned at all in the radio story, and fades into obscurity at the end of her book about the same Manila prison camp experience.