Information loss

August 5, 2021
Michael L. Nelson

The Internet is now home to modern culture. But like paper, it is at risk of decay unless we archive and store it properly.

Our cultural discourse increasingly happens on the web. Many years ago, conventional mediums (e.g., music, movies, books, scholarly publications) were primary and the web was a supplementary channel. This has now changed: the web is often the primary channel and other publishing mechanisms, if present at all, supplement the web.

Unfortunately, the technology for publishing information on the web always outstrips our technology for preservation. With the primacy of the web our scientific, legal, and cultural record is at risk of decay, tampering, and uncertain provenance.

One of the most convincing examples of the centrality of the web for influencing, conveying, and recording our culture is a popular recurring segment on a late night TV talk show (‘Jimmy Kimmel Live!’), where celebrities humorously read ‘mean tweets’ (written by average web users) about themselves. Of course these segments are popular on YouTube. Although a conventional TV show remains part of the process, the source material is from the web (Twitter), and after broadcast the show is available on the web (YouTube). But whose responsibility is it to preserve those videos? YouTube’s? ABC Studios’ (who own the TV Show)? Unlike hard copy media, web resources are not amenable to benign neglect – they need active management to be preserved.

Unfortunately, a significant problem facing web archiving is that it remains at the fringes of the larger web community. My best anecdote pertains to a web archiving paper submitted by my team to the 2010 Word Wide Web (www) conference. One of the reviews stated: ‘Is there [sic] any statistics to show that many or a good number of Web users would like to get obsolete data or resources?’ This is just one reviewer, but the terminology used (‘obsolete data or resources’) succinctly captures the problem: web archiving is not widely seen as a priority or even as in scope for a conference such as www.

Another common misconception is that the Internet Archive has every copy of everything ever published on the web, so preservation is a solved problem. Despite the heroic efforts of the Internet Archive, the reality is more grim: only sixteen percent of the resources indexed by search engines are archived at least once in a public web archive.

While there are many specific challenges with regards to quality criteria, tools, and metrics, the common thread goes back to the fact that we, the web archiving community, have failed to articulate clear, compelling use cases and demonstrate immediate value for web preservation. For too long web preservation has been dominated by threats of future penalties, such as hoary stories about file obsolescence (‘some day we won’t be able to read gifs!’) that have not come true. The lack of a compelling, immediate use case for archives has relegated preservation to an insurance-selling idiom (‘some day you’re going to lose all of this!’), where uptake is unenthusiastic at best.

Recent studies have shown that our scholarly and legal record is decaying. A 2014 study measured ‘reference rot’ in scholarly articles and found that twenty per cent of articles (from a sample published between 1997 and 2012) with links to general web resources (i.e., not just to other scholarly publications) experienced either the familiar ‘404 Not Found’ error, or the more insidious and difficult to detect problem of content drift: whereby the web server returns a web page, but it is no longer about the same topic as when the link was created.

Another recent study, that attracted a lot of attention in the popular press, found that approximately seventy percent of links in the Harvard Law Review, and fifty per cent of the links in US Supreme Court Opinions, also experience reference rot. These findings provided the impetus for the Perma.cc archiving project at Harvard University, an on-demand archiving service run by and for academic libraries.

But it is just not the scholarly and legal record that decays. In previous work we measured the half-life of popular music videos on YouTube to be as little as nine months, even though there can be hundreds or even thousands of simultaneous copies of the video on YouTube at any given time. So while you are likely to find a copy of a particular popular song on YouTube, any specific URL that you have linked or bookmarked will likely decay. Given enough metadata you can re-find the song, but if the anchor text is just ‘I really like this song’, then the chances of re-finding are greatly reduced.

Social media is also subject to the same rate of disappearance as regular web sites. Social media is often considered ephemeral and not worth preserving (e.g., cat pictures and lunch updates), but this is not always so. In many ways social media has become the new ‘first rough draft of history’. In 2012 we measured the decay of Tweets from the 2011 Egyptian Revolution as well as five other significant historical events. We found that within the first year, eleven percent of the resources linked from the tweets (e.g., pictures, news stories) had disappeared, with only twenty percent of the resources appearing in a public web archive. Each following year saw approximately seven percent of the resources disappear. The 404 status of the tweet, and image embedded in the tweet, was originally ‘An armed man runs on a rooftop during clashes between police and protesters in Suez, January 28, 2011’. Long after the original tweet and image had been deleted, the Topsy service (topsy.com) had the tweet and image archived, but by February 2015 the tweet and image had been removed from there as well. These tweets are not ephemera – they are certainly the real first rough draft of history.

Another more recent example was the downing of Malaysia Airlines Flight 17 (mh17) in the summer of 2014. An archived version of a Russian-language social media site appears to show the Ukrainian rebels claiming to have shot down a plane with an accompanying video, a claim that was later removed from the live website. A copy of the page in the Internet Archive shows an archived version of the original claim plus the video but clicking on the video reveals that the video itself is not actually archived.

The mh17 example highlights that a page ‘being archived’ is not a binary condition. Modern web pages often need dozens to hundreds of embedded images, style sheets, and scripts to properly render. Unfortunately, as many as twenty-four percent of html pages do not have all of their embedded resources archived. In our research group we are concerned about shifting the focus from ‘hooray, the page is archived!’ to asking quantitatively: ‘how well is this page archived?’

In one project we measured the impact of missing embedded resources. The idea is that not all missing resources equally impact the quality of the page: we cannot simply report ‘only one out of fifty embedded resources are missing’, because if the single missing resource is an embedded video (e.g., from YouTube) then the user perception will be that the page is not well-archived. On the other hand, there are some archived pages missing many resources that have little to no impact on the user’s experience – for them the page is not damaged at all. Of course, if the resource is missing it is difficult to assess how important it originally was. Using structural hints from archived web pages, we created a method for automatically assessing the damage of a page that strongly correlates with human assessment of the page’s damage. In assessing the holdings of the Internet Archive from 1996 to present day, we found that although the rate of missed embedded resources is slowly increasing, the rate of missing important resources is decreasing.

In another project we focus not on the embedded resources that are missing, but rather on the embedded resources that are present and whether or not they are correct. It is possible for archives to replay web pages that contain temporal violations – combinations of the html page and images that never actually existed together on the live web. As many as five percent of the pages from the Internet Archive can be shown to contain temporal violations, and only eighteen percent of the pages from the Internet Archive are both temporally consistent and complete (i.e., missing zero embedded resources). In the same study we also found that including archives in addition to the Internet Archive decreased the number of missing resources, but at the expense of possibly introducing more temporal violations.

The Internet Archive dominates most discussions of web archiving, and rightfully so since it is both the first and largest public web archive. However, there are increasing numbers of public web archives, many of which are part of the International Internet Preservation Consortium. The Memento Project is a joint effort between Old Dominion University and Los Alamos National Laboratory that defines the mechanics of inter-archive access, making it easier for clients to leverage all public web archives when viewing the past web. As mentioned above, one of the concerns regarding web archiving is that once people are convinced of the problem, they will assume that the existence of the Internet Archive means that the problem is already solved. However good a job the Internet Archive does, a single copy of the cultural record is not a good defence against decay. As such, the Internet Archive shares copies of its holdings with archives in diverse geographical and legal locations.

It is important to remember that web archives are vulnerable to technological, legal, and economic threats. There are already a number of public web archives that have either failed (e.g., mummify.it) or have experienced significant service interruptions (e.g., www.peeep.us). In short: archives are not magic websites, they are just regular websites and are subject to the same threats. Solutions that suggest rewriting links to point to web archives and do not retain the original URL are an especially insidious form of decay introduced (with good intentions) by the page’s author. For example, links that were rewritten to point to ‘https://mummify.it/XbmcMfE3’ will no longer work, and now there is now way to know the value of the original URL.

Despite the primacy of the web as our cultural discourse, web archiving is often considered unimportant or viewed as a solved problem not worthy of further study. Both attitudes put our cultural record – social, legal, and scholarly – at risk of decay. To date, most discussions about web archiving have stopped simply at rejoicing that a web page is available in a web archive, but future research will have to focus on quantitatively evaluating how well a page has been archived. Preservation of the cultural record is task that requires collaboration and coordination, and the future will shift from simply arresting its decay to assessing its veracity.

This essay originally appeared under the title ‘Information Loss’ in ‘Decadence & Decay: Perspectives from the Engelsberg Seminar’, Bokförlaget Stolpe, in collaboration with Axel and Margaret Ax:son Johnson Foundation, 2014.

Information loss

Michael L. Nelson

Latest essays

The contours of Trump’s Middle East policy

The espionage revolution

Conflicting ideas of liberty

Theodore Roosevelt’s lessons in global power