The real open source AI?
There is a lot of discussion going on about “open source AI” these days. While self-proclaimed open source AI that is not clear in meaning is rampant, there is also a movement to define open source AI properly. The most prominent example is probably the Open Source AI Definition (OSAID), led by the Open Source Initiative. I recently attended the discussion in Paris the other day and was very inspired.
At that time a question came to my mind. Is there a concept in the field of AI implementation that is equivalent to copyleft?
Copyleft is a concept claimed by some open source licenses, such as the GNU GPL. If you modify and distribute source code for which that license claims copyleft, you are required to publish the modified portions under the same terms as the original. Since it is impossible to make any part of the code under copyleft “closed” (i.e. proprietary), this is a form of licensing that gives more power to the licensee than to the licensor and strongly encourages the sharing and publication of collaborative work products. Since claiming copyleft is not a prerequisite for being open source (or free/libre software), it is also a “strong” form of the open source concept (as with the Open Source Definition, a “weak” or “broad” definition of open source AI is currently largely covered by the OSAID under discussion).
In the case of AI, the question is what should be open-sourced. In the case of traditional open source, the main target was (source) code, which was protected by copyleft (and copyright). In generative AI, however, the relationship between code and data is closer than in traditional software. Therefore, it is not enough for the code to be open-sourced. In fact, in many generative AI systems, the code used to train the model is already open-sourced and released under an open source license, but the required data (e.g. weights) used as the learned/trained results are often not. And the data used for training may include data that is not generally available in the first place. So, even if you had all the code used by OpenAI and Meta available to you in open source, and you had equivalent computing resources, you probably wouldn't be able to recreate GPT-4 or LLaMa exactly.
The real software freedom
Going back to the original FSF’s definition of “software freedom”, free software should allow you to do the following:
The freedom to run the program as you wish, for any purpose (freedom 0).
The freedom to study how the program works, and change it so it does your computing as you wish (freedom 1). Access to the source code is a precondition for this.
The freedom to redistribute copies so you can help others (freedom 2).
The freedom to distribute copies of your modified versions to others (freedom 3). By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this.
These four freedoms must be guaranteed in practice. In the past, they were ensured as long as the source code was freely available (except in exceptional cases, such as when there are software patents, etc.). This is not the case with current generative AI. In so-called open-weight, commercial use and customization are sometimes permitted to a certain extent, but this is a licensing format similar to freeware in the past, and cannot be said to be truly FLOSS — Free/Libre and Open Source Software.
Ultimately, what copyleft was trying to guarantee was the equivalence of the source code and object code. The essence of copyleft is that, while the object code we run may be opaque to humans, the availability of source code that is highly human-readable and corresponds one-to-one with the object code is guaranteed.
Reproducibility as a new copyleft
In this sense, perhaps the concept of Reproducible Builds is perhaps a more appropriate definition of copyleft-like things in Gen AI. This is a software build method that attempts to guarantee that the same source code and the same toolchain will always produce exactly the same object code, even bit-for-bit.
Even if merely the timestamps of the files generated in the build process are different, the resulting hash will change, so it can sometimes be difficult to achieve Reproducible Builds in the strictest sense. However, it can also be said that verification is very easy. This is because it is enough to show that it is possible to produce the same thing from the same source code at hand. Even if users do not want to verify it by themselves because it is too cumbersome, it is enough for a third-party platform like Hugging Face to verify it once.
Of course, it lacks the coercive power of an original copyright-based copyleft, and it is currently impractical in terms of computational resources and cost. However, although huge LLMs are very popular at the moment, I personally think that Small Language Models (SLMs), which are domain-specific and limited in scope, used as a practical tool rather than an all-purpose funny chat buddy, will become more widespread in the future. In that case, making an AI system reproducible and verifying it will not require that much resource.
It seems that Reproducible Builds will be required in cases where truly high-level security is required, such as some kind of government procurement and military purposes, since it is not impossible to make LLMs learn to react unintentionally to certain triggers as a kind of supply chain attack. In such case, I think reproducibility could be a selling point in itself, and although the EU AI Act seems to require open-source AI, it might also be possible to require that AI to be reproducible in the sense that I have been discussing so far.