I don't look at a man who's expert in one area as a specialist. I look at him as a rookie in ten other areas.
-Conor McGregor
This is a scientist
This is an athlete
This is a doctor
This is a business leader
Now, this is a hammer
This is a hammer for heavy objects
This is a hammer for delicate surfaces
This is a hammer for rocks
Each of these hammers and people belong to a general group but each is highly specialized for what they do. The specialization or specificity allows them to excel in a particular field. However, the cost of specialization is not doing well in other areas. You would be foolish to use a sledgehammer to pound a nail in a wall to hang a picture. As I learned at a young age, "use the right tool for the job". You would be hard-pressed to see a Nobel prize scientist do well in the NBA. Now, people can do above average in multiple arenas, though I haven't seen many be at the top in multiple fields.
It's a known fact that specialization and specificity wins in mother nature for a given environment. However, that doesn't always mean it wins over time or in all scenarios. Fish don't work well out of water, polar bears struggle in a swamp, and crocodiles have been around for millions of years and are apex predators but put them in the desert and they are hosed. Being tailor made for your environment helps you succeed but it comes at a cost of being worse in other areas. What mother nature does do to prevent over-specialization is to change the environment over time. Then traits or attributes that are too specialized die out.
In machine learning, it's been a general assertion that models specialized for a task perform better than more general models (assuming you weren't overfitting). That's why you might hear people continually talk about fine-tuning models. They are trying to specialize the model for the task or context at hand. Ultimately, it's about how well you can match underlying distributions between training and inference. For instance, if you're trying to predict insurance risk on homes, separate models for desert and coastal areas would likely each perform better than a model for the entire country.
This general assertion gets turned around a bit with something called ensemble learning. Ensembles are when you take a collection of models that have been specialized, by looking at the data in different ways, and combine their outputs to get an aggregated result. You can think of this as having different advisors for different situations. You'll call a doctor if you're sick and a business leader with a product idea. If you have insurance models for the desert, the mountains, the jungle, the plains, the coastal areas, the city, etc., you run your data through each one and get a prediction. You then have an aggregation function that takes those predictions and transforms them into a final answer. In this way you cover the errors between models but can also make sure you put more weight on the right context. Accordingly, for a house in Florida, the aggregation function might weight the coastal and swamp models a bit more.
To craft a quality ensemble, you need more than just a collection of different models. You need a collection of good and different models. If they are different but not good, they won't boost each other enough. If they are good but similar, they just reinforce each other and don't cover each other's errors. They need to be both good and different.
Mixtral
Anyways, a few months ago, Mistral AI released their Mixtral 8x7B model, a large language model that is a sparse mixture of experts model. What's special about this model is that it achieves very good performance with a substantially number of parameters at inference. This version contains 8 "expert" models within a single model. Each expert has been trained for a different subset of data and contexts. When data flows through the model, each piece routes to only 2 of the 8 experts relevant to the context of the data. This leads to a much accurate and faster assessment with a lower number of used parameters. Mistral created a general model by training specialized models within it.
What's different about these experts from a normal ensemble algorithm is that they are each are using the same underlying algorithm and not every expert operates on each piece of data. Instead of having an actual board of experts where both the underlying brain architecture and life experience is different for each member, it's the same brain architecture with different experiences and you only talk to two of them at a time. It's kind of like how an actor is the same underlying individual, just playing a different role.
Generalized vs Specialized Models
You can build specialized models that beat general models. You can build general models composed of specialized models that beat single, specialized models. You can fine-tune (read: specialize) general models composed of specialized models to beat ordinary general models composed of specialized models. Where does it all end, and what wins out? Well, as with most great things, it depends. It depends on what the task is. Is it a single, specific task or a highly general one? Do you need to identify people in an image, or do you need to identify all objects in an image? Do you need to create text for any query that might be thrown at you, or do you only need to generate product names in the food space?
It also depends on the difference in performance between the two models. GPT4 is going to outperform GPT2, no matter what specialization you've built into it. Unless you're looking at a highly overfit scenario. It's a bit like the fact that we're only 1% different in DNA from chimps, but no chimp can compete at human levels. As Neil deGrasse Tyson explains:
Two questions exist:
What will prevail, many specialized AIs or a generalized AI?
How much better does one model need to be where specialization no longer matters?
The way I see it, the answer is dynamic and cyclic with the answer to the second question giving you the answer to the first. As long as performance is close, specialization is going to proliferate. If a generalized model is substantially better, probably only on the order of a few percentage points, it will easily outcompete the specialized models. However, that means, only generalized models are left standing. Which means the way to win in a world of similar performing models is through specialization and fine-tuning. Therefore, I expect to see boom and bust cycles of classes of models over the next few years in the AI world as our research and development in this area mimics an evolutionary process.
Thanks for a great primer on ensemble models in general and Mixtral 8x7B in particular. I had a vague idea of how they functioned, but this gave me a deeper appreciation of it.
From what I'm observing, the trend is currently pivoting to these types of mixed models, but the pendulum may well swing back with e.g. GPT-5 or other paradigms within generalist models.