Recent deals between Generation AI and media companies
OpenAI, known for its ChatGPT and other products, has announced a partnership with Reddit, the world's largest message board site. The announcement states that "OpenAI will access Reddit’s Data API, which provides real-time, structured, and unique content from Reddit." which seems to mean that Reddit posts will be a major learning target.
The key to improving the performance of generative AI is training data, but the data available for training is becoming increasingly scarce these days. For humans, Wikipedia and the World Wide Web are vast oceans of knowledge that can never be fully tapped, but AI has already learned enough. OpenAI has been sued by the New York Times and others for unauthorized use of its content, and to avoid such risks as much as possible, it needs to form formal partnerships with media companies as a source of data. In fact, OpenAI recently signed a similar agreement with the Financial Times, and while there is no paywall on Reddit, it is certainly part of OpenAI's desperate effort to secure as much data for training as possible without the risk of litigation.
Content Contamination by Generated AI
Although I believe that data from Reddit and other sites has been included in GenAI's learning data for some time, I have some concerns about its formal inclusion in future training AIs. This is because Reddit is already considered "polluted" with AI-generated content. For example, some Reddit users post many (somewhat nonsensical) long comments every few minutes, not at a rate that is possible for humans, but likely the work of bots that use generative AI to post automatically. Such AI bots exist in large numbers, not just on Reddit, and regulation (even with AI for detection, for that matter) is a long shot. Historically, spammy "trolls" have been relatively easy to detect because they post in a fixed pattern, regardless of the context of previous and subsequent posts, or because they do not feel as natural as human sentences. AI-powered bots are not.
Most importantly, the Web itself is full of AI-generated content. According to NewsGuard, a news verification organization, there are at least 840 "news sites" that appear to be automatically generated by AI, and that number is expected to grow, not shrink. While there are some cases where disinformation or fake news is being spread with some intent as part of information warfare, there are also many cases where lies are being spread simply because the AI is dumb. The AI has learned it and is repeating it. If people learn that their content is the target of the AI's learning, it will be even more likely that people will try to plant a discourse that suits them by mass-spreading it. This is reminiscent of Jorge Luis Borges' novel "Tlön, Uqbar, Orbis Tertius," which describes a secret society that attempts to create another world from a fabricated encyclopedia.
The auto-intoxication of generative AI is happening
In short, generative AI is now becoming something that learns the result of AI's output, then outputs it, and then learns it...and so on. This is the aspect of AI's intoxication with itself. And the situation is aggravated by the fact that people who do not have the knowledge to check it take it for granted, and "humans" write and confirm it in the same way. In any case, it is hard to believe that this trend will lead to an improvement in the accuracy of AI, especially in solving the biggest problem at the moment, which is hallucination (a phenomenon that AI tells lies and fabrications with impunity).
Of course, techniques such as few-shot learning and transfer learning have been studied extensively to improve accuracy even with limited training data, but I personally do not think they will be decisive unless there is a significant breakthrough. There is also the idea that even learning data will be synthesized by AI, since data is data, so there is no problem in letting it learn (there are many studies and startups that synthesize data for training with simulations and algorithms), but just as making a copy on a photocopier gradually blurs the image, there is a limit to what AI can do. I think the old principle of GIGO (garbage in, garbage out) still applies today.