The loss of such a well-funded and popular site should give pause to any historian planning a digital project. Having read about planning, digitization, design, copyright, and building an audience of users and perhaps contributors, readers of this book now likely understand that digital history may require just as much work—and possibly much more—than projects involving paper and ink.
Inevitably, any creative work that requires a significant amount of effort will elicit concern about how to ensure its ongoing existence. Although historians may not consider often short-lived materials such as exam study guides valuable, when planning a substantial online history project such as an archive or exhibit it makes sense to think just as deeply about how to preserve such work for the long run as you do about the issues related to the creation of your digital materials in the first place. Similarly, if you have spent a great deal of time collecting historical documents over the web, you should be concerned about being able to reproduce those documents for others in the years to come as well as upholding your ethical obligation to contributors to save their donations. It would be a shame to “print” your website on the digital equivalent of the acidic paper used by many Victorian publishers, which is now rapidly deteriorating in libraries around the world.
In this chapter we discuss why such losses are common in the digital realm, and how you can try to avoid such a fate. Although we investigate traditional archiving principles and cutting-edge digital preservation topics and software, as before our aim is decidedly pragmatic. We focus on basic ways that you can prepare your website for a long, if not perpetual, existence online.
Fragility of Digital Materials
If only digital preservation were as easy as changing the quality of the paper we print on, as publishers and archivists have done by using high-grade acid-free paper for documents deemed sufficiently important for long-term preservation. Electronic resources are profoundly unstable, far more unstable than books. On the simplest level, many of us have experienced the loss of a floppy or hard drive’s worth of scholarship. The foremost American authority on the longevity of various media, the National Institute of Standards and Technology (NIST), still cannot give a precise timeline for the deterioration of many of the formats we currently rely on to store precious digital resources. A recent report by NIST researcher Fred R. Byers notes that estimates vary from 20 to 200 years for popular media such as the CD and DVD, and even the low end of these estimates may be possible only under ideal environmental conditions that few historians are likely to reproduce in their homes or offices. Anecdotal evidence shows that the imperfect way most people store digital media leads to much faster losses. For example, a significant fraction of collections from the 1980s of audio CDs, one of the first digital formats to become widely available to the public, may already be unplayable.
The Library of Congress, which holds roughly 150,000 audio CDs in conditions almost certainly far better than those of personal collections, estimates that between 1 and 10 percent of the discs in their collection already contain serious data errors. 1
Moreover, non-digital materials are often usable following modest deterioration, while digital sources such as CDs frequently become unusable at the first sign of corruption. Most historians are familiar (perhaps unconsciously) with this principle. We have gleaned information from letters and photographs discolored by exposure to decades of sunlight, from hieroglyphs worn away by centuries of wind-blown sand, and from papyri partially eaten by ancient insects. In contrast, a stray static charge or wayward magnetic field can wreak havoc on the media used to store “digital objects” (a catchall term that refers to everything from an individual image file or word document to a complex website) that we might wish to look at in the future. Occasionally the accidental corruption of a few bits out of the millions or billions of bits that comprise a digital file renders that file unreadable or unusable. With some exceptions, digital formats tend to require an exceedingly high degree of integrity in order to function properly. In an odd way, their perfection is also their imperfection: they are encoded in a precise fashion that allows for unlimited perfect copies (unlike, say, photocopied paper documents), but any loss of their perfection can mean disaster.
Yet this already troubling characterization of digital materials only begins to scrape the surface of what we are up against in trying to save these bits. Historians—even those strongly committed to long-term preservation—can lose important digital resources in some very unsettling ways. The Ivar Aasen Centre of Language and Culture,
a literary museum in Norway, lost the ability to use its large, expensive electronic catalog of holdings after the one administrator who knew the two sequential passwords into the system died. The catalog, an invaluable research tool stored in an encrypted database format, had taken four years to create, and contained information about 11,000 titles. After desperately trying to break into the system themselves, the Centre sent out an open call for help to computer experts and less above-board types, the reward being a round-trip flight to a Norwegian festival of literature and music. Within five hours a twenty-five year-old hacker, Joakim Eriksson of Växsjö, Sweden, figured out that the first password needed to access the system was the administrator’s last name spelled backwards. (The second password, equally suspect security-wise, was his first name spelled forwards.) 2
Beyond the frightening possibilities of data corruption and loss of access, all digital objects require a special set of eyes—often unique hardware and accompanying operating system and application software—to view or read them properly. The absence of these associated technologies can mean the effective loss of digital resources, even if those resources remain fully intact. In the 1980s, for instance, the British Broadcasting Corporation (BBC) had the wonderful idea of collecting pieces of life and culture from across the UK into a single collection to honor the 900
th anniversary of William the Conqueror’s Domesday Book, which housed the records of eleventh-century life from over 13,000 towns in England following William’s invasion of the isle in 1066. Called the Domesday Project, the BBC endeavor eventually became the repository for the contributions of over a million Britons. Project planners made optimistic comparisons between the twentieth-century Domesday and its eleventh-century predecessor; in addition to dozens of statistical databases, there would be tens of thousands of digital photographs and interactive maps with the ability to zoom and pan. Access to this massive historical snapshot of the UK would take mere seconds compared to tedious leafing through the folios of the Domesday Book.
Such a gargantuan multimedia collection required a high-density, fully modern format to capture it all—so the BBC decided to encode the collection
on two special videodiscs, to be accessed on specially configured Philips LaserVision players with a BBC Master Microcomputer or a Research Machines Nimbus. By the late 1990s, of course, the LaserVision, the BBC line of computers, and the Nimbus had all gone the way of the dodo, and this rich historical collection faced the prospect of being unusable except on a few barely functioning computers with the correct hardware and software translators. “The problems of software and hardware have now rendered the system obsolete,” Loyd Grossman, chairman of the Domesday Project, fretted in February 2002, “With few working examples left, the information on this incredible historical object will soon disappear forever.” One suspects that the Domesday Book’s modest scribes, who did their handiwork with quills on vellum that withstood nine centuries intact and perfectly readable, were enjoying a last laugh. Luckily some crafty programmers at the University of Michigan and the University of Leeds figured out how to reproduce the necessary computing environment on a standard PC in the following year, and so the Domesday videodiscs have gotten a reprieve, at least for a few more years or perhaps decades. But this solution came at considerable expense, a cost not likely to be born for most digital resources that become inaccessible in the future. While the U.S. Census Bureau can surmount a “major engineering challenge” to ensure continued access to the 1960 census, recorded on long-outdated computer tapes, an individual historian, local history society, or even a major research university will be unlikely to foot similar bills for other historical sources. 3
We could fill many more pages (of acid-free paper) with examples of such digital foibles, often begun with good intentions that in hindsight now seem foolish. Digital preservation
is a very serious matter, with many facets—yet unfortunately no foolproof solutions. As Laura McLemore, an archivist at Austin College, concludes pragmatically, “With technology in such rapid flux, I do not think enough information is available about the shelf life . . . or future retrieval capabilities of current digital storage formats to commit to any particular plan at this time.” The University of Michigan’s Margaret Hedstrom, a leading expert on digital archiving, bluntly wrote in a recent report on the state of the art (co-sponsored by the National Science Foundation and the Library of Congress), “No acceptable methods exist today to preserve complex digital objects that contain combinations of text, data, images, audio, and video and that require specific software applications for reuse.”
It is telling that in our digital age—according to the University of California at Berkeley, ink-on-paper content represented just 0.01% of the world’s information produced in 2003, with digital resources taking up over 90% non-printed majority—the
New felt compelled to use an analog solution for their millennium time capsule, created in 1998-1999. The York Times Times bought a special kind of disk, HD-Rosetta, pioneered at Los Alamos National Laboratory to withstand nuclear war. The disk, holding materials deemed worthy for thousand-year preservation by the editors of the Times magazine, was created by using an ion beam to carve letters and figures into a highly pure form of nickel. Etched nickel is unlikely to deteriorate for thousands, or even hundreds of thousands, of years, but just to be sure, the Times sealed the disk in a specially made container filled with the highly stable, inert gas argon, and surrounded the container with thermal gel insulation. 5
Even skimping a bit on the argon and thermal gel, this is an expensive solution to most historians’ preservation needs. Indeed, we believe that any emphasis on technological solutions, particularly those hatched at places like Los Alamos and even (more seriously) the highly sophisticated, well-funded computer repository systems we explore at the end of this chapter, should come second to more basic tenets of digital preservation that are helpful now and are generally (though not totally) independent of the murky digital future. Archivists who have studied the problem of constant technological change realized some time ago that the ultimate solution to digital preservation will come less from specific hardware and software than from methods and procedures related to the continual stewardship of these resources.
That does not mean that all technologies, 6 file formats, and media, are created equal; it is possible to make recommendations about such things, and where possible we do so below. But sticking to fundamentally sound operating principles in the construction and storage of the digital materials to be preserved is more important than searching for the elusive digital equivalent of acid-free paper.
What to Preserve?
Think for a moment about the preservation of another precious commodity: wine. Connoisseurs of wine might have cheap bottles standing upright in the heat of their kitchen—a poor way to store wine you wish to keep it around for a long time, but fine if you plan to drink it soon or don’t care if it gets knocked over by a dog’s tail—while holding their Chateau Lafite Rothschilds on their side at 55 degrees and 70 percent humidity in a dark cellar (ideal conditions for long-term storage of first-growth Bordeaux). Fine wine, of course, is worth far more attention and care than everyday wine. Expense of replacement, rarity, quality, and related elements factor into how much cost and effort we expend on storing such objects for the future. Librarians and archivists have always been attuned to such questions, storing some documents in the equivalent of the kitchen and others in the
equivalent of a wine cellar, and historians interested in preserving digital materials should likely begin their analysis of their long-term preservation needs by asking similar questions about their web creations. What’s worth preserving?
The United States National Archives and Records Administration (NARA), entrusted to preserve federal government materials such as the papers of presidents and U.S. Army correspondence, has a helpful set of appraisal guidelines they use in deciding what to classify as “permanent”—i.e., documents and records that they will expend serious effort and money to preserve. (Although your archival mission will likely differ in nature and scope from NARA’s ambitious mission to capture “essential evidence” related to the “rights of American citizens,” the “actions of federal officials,” and the “national experience,” their basic principles still hold true regardless of an archive’s scale.) Many of these straightforward guidelines will sound familiar to historians. For example, you should try to determine the long-term value of a document or set of documents by asking such questions as,
Is the information [they hold] unique?
How significant is the source and context of the records? and
How significant are the records for research [current and projected in the future]?
Other questions place the materials being considered into a wider perspective. For example, “Do these records serve as a finding aid to other permanent records?” and “Are the records related to other permanent records?” In other words, by themselves some records have little value but they may provide insight into other collections, without which those other collections may suffer. It therefore may be worth preserving materials that taken by themselves have little perceived value. Finally, for documents not clearly worth saving but also not obvious candidates for the trash bin, NARA’s guidance is to ask questions related to the ease of preservation and access in the future: “How usable are the records?” (i.e., Are they deteriorating to such an extent as to make them unreadable in the near future?); “What are the cost considerations for long-term maintenance of the records?” (e.g., Are they on paper that may decay?); “What is the volume of records?” (i.e., the more there is, the more it will cost to store them). This list of appraisal questions comes out of a well-established archival tradition in which objects such as the parchment of the Declaration of Independence and United States Constitution stand at the top of a preservation hierarchy, receiving the greatest attention and resources (including expensive containers and argon gas), and less valuable records such the casual letters of the lowest-ranking bureaucrat receive the least amount of attention and resources. In NARA’s physical world of preservation, this hierarchy is surely prudent and justified.
NARA’s sensible archiving questions take on a wholly different character in the digital, non-physical online world, however. Questions relating to deterioration—at least in the sense of light, water, fire, and insect damage—are irrelevant. The tenth copy of an email is as fresh and readable as the “original.” “Volume” is an even odder question to ask about digital materials. What is “a lot” or perhaps even “too much,” and when do we start worrying about that frightening amount? The White House generated roughly 40 million email messages in Bill Clinton’s eight years in office. At the end of Clinton’s term (the year 2000) the average email was 18.5 kilobytes. Assume for the sake of argument (and ease of calculation) that the policy wonks in Clinton’s staff were as verbose as they were prolific, writing a higher-than-average 25 kilobytes per email throughout the 1990s. That would equal roughly
a thousand million kilobytes, or a million million bytes—that is, 1 terabyte, or the equivalent of a thousand Encyclopedia Britannicas—of text that needs to be stored to preserve one key piece of the electronic record of the forty-second president of the United States. That certainly sounds like a preposterous amount of storage—or we should say that sounded like a preposterous amount of storage, since by the time this book is in print there will almost certainly be computers for the home market shipping with that amount of space on the hard drive. At Clinton’s 1993 inauguration, one terabyte of storage cost roughly $5 million; today you can purchase the same amount of digital space for $500. 8
In the pre-digital age, it would have been impossible to think that a researcher could store copies of every letter to or from, say, Truman’s
White House, in a shoebox under her desk, but that is precisely where we are headed. The low cost of storage (getting radically less expensive every year, unlike paper) means that it very well may be possible or even desirable to save everything ever written in our digital age. The selection criteria which form the core of almost all traditional archiving theories may fall away in the face of being able to save it all. This possibility is deeply troubling to many archivists and librarians since it destroys one of the pillars of archiving—that some things are worth saving due to a perceived importance, while other things can be lost to time with few repercussions. It also raises the specter that we may not be able to locate documents of value in a sea of undifferentiated digital files.
Surely this selection or appraisal process remains important to any discussion of preservation, including digital preservation, but the possibility of saving it all or even saving most of it presents opportunities for historians and archivists that should not be neglected. Archives can be far more democratic and inclusive in this new age. They may also satisfy many audiences at once, unlike traditional archives, by providing a less hierarchical way of approaching stored materials. Blaise Pascal famously said that “the heart has reasons that reason cannot understand”; we have found that in the world of digital preservation, researchers have reasons for using archives that their creators cannot understand.
Or predict. Our September 11
Digital Archive, containing over 140,000 digital objects related to the terrorist attacks and their aftermath, and contributed by tens of thousands of individuals, receives thousands of visitors every day who want to know how people felt on that tragic day and in the days and months following. In 2003, most of these visitors came to the Archive via a search engine, having typed in (unsurprisingly) “September 11” or “9/11.” But because of the breadth of our online archive, 228 visitors found our site useful for exploring “teen slang,” 421 were searching for information on the “USS Comfort” (one of the Navy’s hospital ships), and 157 were simply looking for a “map of lower Manhattan.” In other words, thousands of visitors came to our site for reasons that had absolutely nothing to do with September 11, 2001. Historians should take note of this very real possibility when considering what they may wish to preserve. Brewster Kahle, the founder of the Internet Archive, likes to say that his archive may hold the early writings of a future president or other figure that historians will likely cherish information about in decades to come. Assessing which websites to save in order to capture that information in the present, however, is incredibly difficult—indeed, perhaps impossible.
The NARA questions about the relationship between materials under consideration for archiving and those already preserved also take on different meanings in the digital era. One of the great strengths of the web is its interconnectedness—almost every site links to others. Such linkages make it difficult to answer the question of whether a set of digital documents under consideration for preservation is relevant to other preserved materials. Because of the
interconnectedness of the web, the best archive of a specific historical site is probably one that stores it embedded in a far larger archive of the web itself. 11