AI Snake Oil

Understanding hype from reality

Feb 09, 2023

"All that glitters is not gold" - Shakespeare

There you are watching an incredible demo for a product. Everything the vendor is showing you seems to solve your problem exactly. You've been struggling with this problem for years. No one has been able to crack the code but now this vendor has! Plus, the UI is slick. They have been able to exceed the performance of humans. And they are doing it with AI. Is what they are selling you legit? Before you pull out your checkbook, let's discuss how to identify AI snake oil.

Limits

Over time there has been a lot of hype in the terminology used around AI. While the explosion of generative AI seems to have people putting the AI label on everything now, the same behavior was also happening 3-4 years ago. So much so that Arvind Narayanan, a Princeton professor, put together a great paper and presentation about how to recognize AI snake oil. You can also watch the presentation here. Arvind was taken aback by some of the claims that these AI products were making and decided to dig deeper. He broke things down into three main categories:

Genuine, rapid progress. Typically, these are systems involving perception.
Imperfect but improving. Typically, these are systems involved with automating judgement.
Fundamentally dubious. Typically, these are systems that predict social outcomes.

Figure 1 - Arvind Narayanan's guide for types of AI claims

The implication here is that there are areas where AI is very good, areas where improvements are being made, and areas where AI actually performs poorly. Trusting AI in poor performing areas is potentially dangerous from over reliance on the results as truth even though model performance is low. In the hands of an uninformed decision maker, who does not know all of the intricacies, exceptions, and errors of a model, results are typically taken at face value. No model is perfect. In industry I’ve seen a lot of human processes that have poor accuracy but each result was taken as truth despite this poor performance.

We've made a lot of progress in just the three years or so since Arvind’s presentation came out. The improvements in perception have allowed for generative AI to enter the stage of providing human-level content. Automating judgement remains in an improving state depending on the domain and use case. However, automating judgement along with predicting social outcomes have a lot of situations where simple models can outperform more complex models.

Why? There’s a concept in statistics called the Cramer-Rao Lower Bound and it describes the best performance we can expect out of an unbiased model. This bound is the theoretical limit of how good a model could be and anything claiming to do better is likely mistaken. Some algorithms can get models closer to this limit but typically it is easier to improve by getting better sources of data. The issue with problems in the fundamentally dubious section is that simple models already take you close to the Cramer-Rao Lower Bound and the additional data you hope improves performance does not make a difference or does not exist. Hard problems remain hard and slapping AI on it won’t necessarily solve all of your woes.

Experts

A good way to think about AI is, AI is an accelerant. It can take something existing and make it better/faster. AI is not a magic box that performs miracles. If the solution you are looking at is claiming to perform miracles or that a longstanding problem is suddenly solved overnight by AI, you should be suspicious. A good heuristic for evaluating if AI can do something well is

If an expert human, given an adequate amount of time, could not make this decision with a high degree of accuracy then it is unlikely that an AI will be able to either.

I stress expert here. When we look at the performance of a machine learning model vs humans, our goal is to meet or exceed a group of experts. That is the benchmark needed to have sufficient trust that we can automate the decision making with a model. At the end of the day most machine learning models are about automating expert decision making. In some cases, this automation performs at a faster pace such as medical diagnosis, loan decisions, or finding the right web page. In other cases, the level of automation opens up completely new capabilities such self-driving cars, generative art, or identifying all known proteins. However, getting to these points required building on the high degree of knowledge of experts.

There are currently two driving forces that threaten the quality of knowledge needed to build better systems and could lead to another trough of disillusionment for AI. The first, and more immediate, is more “AI experts” flocking into the market to chase the hot “new” area. If you are buying an AI system you want it built by a true AI expert. These individuals have likely been in the field for at least 7-10+ years, have strong math and computer science backgrounds, and are not chasing hype. There’s a lot of subtlety and nuance to how you apply AI well and that knowledge takes a long time to accumulate. Conversely, a recent “expert” is likely to make claims that are not actually true or miss subtleties in application that will eventually lead to you misallocating your dollars. If there are less true experts available, do you have the expertise to actually assess the solution you are looking at?

The second force is the potential enshittening of knowledge caused by ChatGPT, whereby the reliance on average to false information creates a bad feedback loop. Currently ChatGPT is a mediocre human without expert ability in many areas. In fact, people are discouraged from using it for any sort of expert advice due to its lack of sound footing. However, as mentioned earlier, if individuals are not aware of these cautions they may take the answer at face value. Think of how often you google the answer to a question and take the first result as the answer. What happens now if you can’t see additional results? Over time, a reliance on faster answers at the cost of correctness will erode the quality of knowledge we have. This will make expert knowledge even more valuable. But you need to have an understanding of what kind of expert you are dealing with, along with their limits.

Piercing the curtain

If you needed to evaluate whether or not an AI system could actually achieve its claims, how would you do it? Here’s a list of questions to sniff out if you’re looking at a solution that is legitimate or snake oil.

Does your problem reside in an area where AI performs well or is it in an area that is fundamentally dubious?
What’s the human benchmark for doing well? What’s the theoretical limit of performance?
If expert humans were given enough time, could they accomplish a similar degree of accuracy as this solution?
How much better is the proposed solution than a simple model? Is that difference significant?
Can you run your own examples through the system live?
Am I talking to an actual expert on this topic or are they reselling an expert system/AI utility?

In using these questions your goal is to figure out if the solution is plausible. The first two questions provide a reference point for how to think about the performance. The next two determine if you actually need an AI model. The last two questions attempt to ascertain if the system will perform as claimed.

If you have other questions or methods you’ve used to ensure something isn’t AI snake oil I’d love to hear them. Feel free to leave a comment with what you’ve used.

Embracing Enigmas

Discussion about this post