The Vanishing WWW
A recent report by the Pew Research Center is making headlines.
The study is based on data from Common Crawl, a non-profit that regularly crawls (downloads) the entire web and provides the collected data as open data. The report makes it clear that a huge amount of web content is being lost.
For example, 38 percent of the web pages that existed at the time of the 2013 crawl have already disappeared. This is because the relevant pages have either been deleted or the entire website has disappeared. In addition, a quarter of the web pages that existed between 2013 and 2023 are no longer accessible as of October 2023. It is not only old pages that disappear, but 8 percent of the pages that existed in the 2023 crawl are already unavailable.
The same applies to social media, with about a fifth of X/Twitter posts (tweets) disappearing within a few months of being posted. Of these, 60 percent have had their accounts themselves set to private or deleted. Interestingly, more than 40 percent of tweets written in Turkish and Arabic disappear within three months. That is proof of how rampant impression fraud in these languages is.
Furthermore, 23 percent of news site web pages on the web contain at least one broken link. This is true regardless of whether the site is a high-traffic (popular) site or not. Twenty-one percent of web pages on government sites also contain at least one broken link, with local government web pages being particularly likely to do so. Wikipedia also contains at least one link to a non-existent page as a 'References' in 54 percent of its articles. Wikipedia is strict on "Citation needed", but in many cases, the sources themselves no longer exist.
Into the dark
The World Wide Web was once a problem because, like the Streisand effect, once data is leaked, it cannot be erased. The more you try to erase it, the more it grows. However, it is rather the disappearance of data that we should be concerned about in the future. Recently, I discussed the depletion of training data for generative AI. However, the World Wide Web, one of the main sources of AI training data, is itself becoming thin. It’s not just text data that’s at risk. Videos, too, are concentrated on YouTube, and they are in considerable danger. If YouTube were to disappear tomorrow, a not-so-small percentage of the video recordings of the last 20 years or so would disappear completely.
The disappearance of web pages that existed in the past is also a problem. However, I am more concerned about the apparent decline in online communication, not only on the web but also in open forums.
For example, it used to be possible to follow the development process of open source software in great detail after the fact. Development discussions were basically conducted on mailing lists and IRC, and almost all records, including informal chats, were kept and available to anyone. It is true that even today, social software development sites such as GitHub keep some logs in the form of issues and tickets, but the nuances are often not captured.
It is clear that today, a great deal of online communication, not just software development, is moving to somewhat private chat and messaging, such as Slack, Discord, or even Telegram. Many of these services are losing their logs, partly because they charge for long-term storage of them. In the past, records were kept on paper and could be kept for thousands of years. In today's information-oriented world, everything might disappear within a few years and may not even be 'excavated' in the future. This makes the future of e-archaeology significantly more difficult. It is our responsibility to future generations to preserve records as much as possible.
The relative decline in the value of information
Some people may wonder, however, why there was such an insistence on open information sharing in the past. In the past, information was scarce and there was a widely shared 'hunger' for information. It was considered a virtue to make information as open and reusable as possible by anyone. As Tim Berners-Lee's pivotal Contract for the Web states, the World Wide Web was created "to make knowledge freely available". In the modern era, we are drowning in information, at least for humans. This may explain why we are becoming relatively less enthusiastic about sharing information. If so, it is ironic that this has happened because the World Wide Web has been so successful.