At any moment there is only a fine layer between the 'trivial' and the impossible. Mathematical discoveries are made in this layer.
― A.N. Kolmogorov
Embracing Enigmas tries to stay away from being a pop-science newsletter about AI. It is meant to be longstanding by showcasing the lessons, mental models, philosophies, and tenets of what it takes to apply AI successfully, particularly in business. There are plenty of great newsletters that report on the latest happenings and their implications, and I’m always happy to provide recommendations and discuss my favorites. While Embracing Enigmas doesn't focus on the latest news in the world of AI and machine learning, recent examples may be mentioned to better illustrate certain topics. However, for this piece we will wander into recent news because this breakthrough is too good to pass up and it helps demonstrate a venerable point in the field. Within the world of AI and machine learning, algorithms come and go at a blazingly fast pace and the biggest gains come not from optimizing how something is done but by doing things in a way that have not been done before.
KANs
In case you haven't heard, this paper was recently released by researchers at MIT, CalTech, and Northeastern and is taking the machine learning community by storm. The researchers created a new algorithm called Kolmogorov-Arnold Networks (KAN), which takes a very different approach from the typical neural network, called a multilayer perceptron (MLP), also known as a feed forward network (FFN). KANs are purported to be much more accurate than MLPs with a significantly lower number of nodes. The way KAN's work is by learning activation functions on the edges as opposed to weights. You can see the difference in the diagram from the paper in Figure 1.
Figure 1. The illustrated difference between MLPs and KANs
Why is this important? Neural networks are function approximators and the team behind KANs created an architecture that approximates functions better. Second, this new network represents a shift in architecture and a shift in thinking for how a neural network can be successful. While we are currently seeing big gains by making larger and larger networks using more and more data, bigger advances in performance will come from improvements in model architecture. KANs show a path forward for what a new architecture could be.
What's the difference between the two methods? Both models create a large equation to go from input to output. MLPs are based on the Universal Approximation Theorem, which states that a large combination of a simple function repeated and weighted variably can be combined together to approximate any other function. While there are theoretically assurances of universal function approximation, there are many practical limitations such as no guarantees that a given architecture will converge or find a given function. Alternatively, KAN's rely on Kolmogorov–Arnold Representation Theorem which states that any multivariate continuous function can be approximated by a combination of univariate (single variable) functions. For those that studied engineering, this is the same as the principle of superposition, where you can add simpler components together to make a complex component. The difference implies that MLPs need scale and depth, as you need as many weighted values as possible to approximate a function, and KANs need breadth and variety in order to maximize the number of function combinations.
Implications
What do KANs enable? First, you can learn functions with much less data due to the lower parameter count. That's important because our minds can do that. Also, most scientific and business problems have a limited amount of data. The initial testing also appears to reach better performance levels with a fewer number of parameters. Second, there's the potential for faster inference as well due to a lower number of parameters. Third, since these networks are more exact for mathematical functions there's the potential to handle more complexity with less error.
There's no such thing as a free lunch, so what's the downside? It is much more computationally intensive to train. If you are going for pure compute efficiency, you'll likely want to use MLPs as KANs take longer to train. Also, KAN models have not yet been tested at scale and it is unknown what limitations will exist in massively large models or if the performance improvements persist. It is also unclear how well KANs deal with discontinuous functions in practice as the Kolmogorv-Arnold Representaiton Theorem is meant to approximate continuous functions. It will be interesting to see what emergent behaviors result from using KANs.
The ideal architecture for a neural network is one where every type of parameter is learnable. The nodes, the layers, the weights, the weighting functions, the connections, everything. To an extent, this ideal mimics how the brain evolves and changes. The way current neural networks are designed, all parameters are fixed except for the weights. Experimentation manipulates the other parameters such as the number of layers, weighting functions, etc. KANs move us towards more malleable neural networks. In KANs, the weighting functions are learnable, as are the connections between nodes. This advances us towards a world where we need to specify less.
What comes next
Now to put things into context, MLPs are a basic kind of neural network that have become components in larger, more advanced networks. Large language models like Claude and GPT-4 are based on a transformer architecture. Transformers have many components but the two components that are repeated the most are attention blocks and MLPs. To understand where MLPs live in a transformer see the figure below. If you want more detail on how transformers are architected, here's a great video by 3blue1brown that illustrates it. Attention blocks and MLPs are alternated in repeated layers that provide a lot of the emergent behavior that you have interacted with. Which means, the first obvious place to test out KANs at scale is to replace the MLP layers with a KAN layers. Be on the look out for a team to do this.
Figure 2. Where MLP components live in a transformer architecture
The most immediate place KANs are likely to be used is in scientific computing. In this realm, KANs have the ability to help deduce exact equations for scientific phenomenon. For most other applications, a lot of testing and experimentation will likely be needed. In the coming months here's what I expect to happen:
People will test how large you can make a KAN while maintaining the original findings of the paper
Someone will test and plot the relationship between the number of KAN parameters and MLP parameters
There will be many papers comparing transformer architectures using MLPs vs KANs and that will attempt to outline situations where one works better than another
Someone will create a combined training-inference cost curve denoting the decision boundary for network size when using KANs vs MLPs pays off. Note that these are likely to be either much more scientific problems, much heavier on the inference equation, or on the smaller network side.
Someone will find a more efficient way to train KANs
Come back to this post in 2-4 months to see which of the above played out. In the meantime, experiment with KANs if you have the means. Otherwise, enjoy the sneak-peek of what's to come.