The AI War: Open-Source vs. Closed-Source Models

Investigating the factors that will determine the winner in the battle for AI dominance

Apr 03, 2023

"If you don't have a competitive advantage, don't compete." -Jack Welch

If you haven’t been paying attention, there’s currently a war going on for control of AI technologies. Elon Musk and other ML researchers recently called for a pause on AI experiments in an open letter. A researcher recently left Google because he claimed they were stealing data from OpenAI’s models. Many, many different large models have been released recently and are being fine tuned with additional data. A lot of companies are vying for power to be the top AI provider but at the moment the moats are thin. However, I was struck by this tweet from Naval:

Naval @naval

The biggest near-term question in AI: Will open-source LLMs with decentralized training be competitive with closed-source and centralized LLMs?

Let’s dig into this and see if we can figure out who will win. We’ll dive into the differences between open and closed source models, look at an adjacent industry that might have historical similarities, and then make some predictions about where this all goes.

Open-Source vs Closed-Source Models

Closed-source models are those where the details of the model are unknown along with constraints of how and when they can be used. Open-source models are those that are openly available for all to use. The world is built on open-source software - from Linux to Postgres to Firefox to PyTorch and the list goes on. In thie vein, Cerebras just announced the release of Cerebras-GPT, supposedly the first compute-optimal fully open-source large language model.

Cerebras @CerebrasSystems

The AI industry is becoming increasingly closed. We believe in fostering open access to the most advanced models. Cerebras-GPT is being released under the Apache 2.0 license, allowing royalty-free use for research or commercial applications. (2/5)

Currently, open-source models and closed-source models have similar levels of performance because they are using similar datasets. That is to say, both are trained on the entirety of the open web. There is a question of if this openly available data is enough for open-source models to achieve competitive model performance without access to proprietary datasets that close-source models will have access to. We do not currently know if there is a set amount of data would provide human level performance after which additional data would not provide any marginal increase in performance. Who wins ultimately comes down to a single question:

Are models that only use openly available data able to remain competitive with those that use proprietary data?

As with all important things, the answer is, “it depends”. It depends on the use case. For specialized use cases, there is no doubt in my mind that closed-source models will likely prevail. These are areas that require highly specialized skills and abilities such as medical diagnoses, pharmaceutical creation, advanced scientific research, and fraud detection. The higher the cost to acquire training data, the less likely it will be included in an open-source model.

Conversely, open-source models are likely to be sufficient and competitive in areas where expertise is not required. Think customer service support, normal conversation, text and speech tailoring, text summarization, stock photo creation, and email generation. These areas could be augmented by RLHF (reinforcement learning with human feedback) approaches. However, additional data is much easier to incorporate back into open-source models rather than the proprietary data because the value of an individual unit of data is much lower. In less specialized areas, open-source models will likely prevail because it will be easier to maintain a sufficient level of performance without falling behind.

However, if a model is falling behind, whether open or closed, there are ways to use the efforts of your competitors to your advantage. Let’s talk about some methods that allow you to “steal” a competitors model just by interacting with it.

Cat and Mouse Games

One interesting twist in the open-source vs closed-source models was Stanford’s Alpaca model showing that. one could train a large AI model by using the inputs and outputs of one model as a dataset to fine-tune another model. In Alpaca’s case, they did this using 52,000 known inputs and outputs from ChatGPT. This was enough data for Alpaca to fine-tune Meta’s LLaMA model to achieve similar to better performance than ChatGPT and cost only ~$600. Then GPT4All took it a step further and trained on ~800k samples.

How does this work? It comes from an idea I heard from Geoffrey Hinton about how to shrink the size of a neural net. Essentially, if you have the inputs to a model and corresponding outputs for a model, then you can create a new model that approximates the old model reasonably well. In the current environment, this means that anyone that is releasing a generative AI API to the public is providing data that can be used to steal the underlying model. Since these providers want to have people using their API, they have to find other ways to protect their IP. This may mean modifying the outputs that are poisoned to prevent others from training with the data. Unfortunately, the process of poisoning is much easier to perform for images than for text. Ultimately, just like cybersecurity, the building and stealing of models will become a cat-and-mouse game.

Adjacent Industry

Let’s look to another industry that has similarities to the current state of AI - weather modeling. If you didn’t know, the two main providers of weather data and modeling are NOAA and ECMWF. NOAA even has a page listing the users of its data. Pretty much anyone else, especially those you are familiar with like The Weather Company and Accuweather, is reselling or repackaging data from these organizations. These 3rd party weather providers attempt to make better models but rarely actually perform better. Instead, they have typically created a better user experience for the data products provided. In fact, Michael Lewis wrote a book about the dynamics of this industry in his book “The Coming Storm”.

The reason for massive amounts of reselling of weather data is that while extremely important, rigorous weather data collection is also extremely expensive. In my mind, this correlates with the current state of generative AI. There are many AI companies that are thin wrappers around the APIs of a few key large model providers (OpenAI, Cohere, Meta, StabilityAI, etc) who have done the hard work of collecting a lot of data and spending expensive compute to train large models. With so few providers of core capabilities, most companies using generative AI need to find other ways to differentiate themselves.

To gain an advantage in weather modeling, you either have to be a much better modeler (ECMWF routinely outperforms NOAA and other US based models), or have better data. The two main services I have seen that were able to get a substantial data advantage were WeatherUnderground and DarkSky (you will be missed!). Both of these services were eventually acquired. They created a data advantage by crowd sourcing sensors to individuals to get a more granular set of data to help localize forecasts. In a similar fashion, those who have substantial proprietary data are going to come out ahead in the AI game. Gaining proprietary data naturally falls to closed-source models where organizations can create products or pay to acquire data much more readily than an open-source project.

Observations and Predictions

Here’s how I think this plays out:

Open-source models will win the pervasiveness game, dominating as a starting point for startups, initial concepts, and non-critical projects. They will be good enough for general use but not good enough for highly specialized or highly competitive areas.
Closed-source models will be able to achieve higher performance due to their creators ability to acquire data that is not openly available. These models will also provide additional tailored features for their customer that the open-source community will not readily know in a time-efficient manner.
These relativistic differences will matter. Any top organization will either heavily modify open-source models with their own data or use a closed-source model vendor.
The ability to “steal” models by interacting with them will create a red queen effect that advances the industry. Staying in the same place will mean falling behind.

Embracing Enigmas

Discussion about this post