Wednesday, August 11, 2010

The Internet's memory is faulty

Cover of "Memento"Cover of Memento
I was recently doing some research for a personal project and wanted to understand how, the content and the look and feel of some prominent web-sites, had changed over time. I had previously used the Internet archive's wayback machine, which stores snapshots of various web-sites, periodically. So I tried the wayback machine on the New York Times. The machine did bring up a number of stored pages from the past. However, on looking closer, I realized that stored pages were not a totally accurate representation of the New York Times' past.

When I clicked on the Apr 22, 2008 entry, I was sent to the current (Aug 11, 2010) page. While other entries, such as the one for Oct 27, 2009 took me to the correct historical page, some of the links on the stored page were broken. For example clicking on the Afghanistan story at the top left, led to an error page. However, similar story links for other dates worked just fine. The inconsistency diluted my confidence in the wayback machine.

Some web searching revealed alternatives to the wayback machine on a blog post from Jan 2008. But the alternatives had their own problems: they were either restricted to a narrow topic (eg health), or a narrow time window (now till 3 months ago) or had stopped functioning altogether (eg Blogging Ecosystem). That is, nothing that could beat the wayback machine.

Perhaps the web needs to adopt a time dimension as proposed by the Memento project. The idea being that the infrastructure underlying all the content pages is made capable of storing history so that queries such as mine are processed automatically by each website without needing a central archive. Till then, does anybody know of mechanism that tops the wayback machine for historical versions of the today's top websites?
Enhanced by Zemanta

0 comments:

Post a Comment