"Perfection is a myth"
- Roger Federer
We live with error in everything we do. From moving around the house, who hasn’t stubbed their toe. To saying the wrong word when speaking. To throwing an errant pass or missing a shot in sports. To predicting the future, which never goes as planned. Perfection is an ideal because it is so rare. It’s rare because error is the norm.
Recently, Roger Federer, the famed tennis player, gave a commencement speech at Dartmouth University. During the speech he noted that he won 80% of his matches. How many points do you think he won? Federer only won 54% of all the points he played. One of the best tennis players of all time won slightly more than half of all the points he played. However, it is the aggregation of that slight edge over many repetitions that allowed Federer to come out consistently on top. His results also imply that he lost 20% of his matches and 46% of his points. In Federer's own words - "perfection is a myth".
We need to understand that a similar relationship with error exists for AI and machine learning. Let's explore various types of error rates for different domains, what is good performance, how performance has changed over time, and who benefits or suffers from errors.
Baselining
Error is domain dependent and problem specific. There are some domains where 70% accuracy is phenomenal (social sciences) and other areas where an accuracy of 70% is abysmal (physical equations). To give you a sense of the range, I gathered the top submissions from a sample of Kaggle competitions, a famous machine learning competition platform, and created the table below. Some of the best modelers in the world compete on Kaggle for prize money typically ranging from $10-100k. Below, the table contains the name of the competition, the performance metric and its description (there are many ways to assess performance besides accuracy), and the best score.
As you look through the table, take note that no top score is perfect. Even with some of the best modelers in the world competing, every model contains error. Part of that is based on the data available, part is due to the nature of the problem, part is due to randomness, part is due to tradeoffs, and part is due to methods of assessment. Humans aren't perfect and neither is AI or machine learning. That's an important concept to come to terms with. Even if AI can write or code or model better than you, better does not mean perfect. AI will still make errors, just as humans do. Better also does not imply better in all areas. Decision surfaces have weaknesses and tradeoffs. A weaker overall method may be stronger in a specific area that the "better" model as chosen to degrade for higher overall performance.
The Impact of Error
Once you understand that error will happen, the next objective is to understand the impact of an error. Not all errors are created equal, both across models and within a model. Getting a medical diagnosis wrong is much, much worse than recommending the wrong TV show. Within diagnosing a certain disease, say breast cancer, a false negative (where the model says someone does not have cancer when in fact they do) can be substantially worse than a false positive (where the model says someone has cancer when in fact they do not) as the latter will result in follow up tests to confirm whereas the former can allow the disease to progress and become worse.
While the above is a qualitative understanding, typically costs are assigned to different types of errors. This creates a cost-threshold curve that allows the creators of a model to control what costs they are willing to accept and the tradeoffs between different rates of error. That is, the same base model can behave differently by modifying the threshold at which a decision is made.
It is important to understand who benefits and who suffers from model error. There are three main parties involved in the use of any model - the creator, the user, and the consumer. These roles can overlap on the same person. The creator develops the model. The user uses the model to make predictions/inferences. The consumer is impacted by the outputs of the model. For instance, using the same breast cancer example above, the creator was a medical device company that developed a model for assessing the cancer, the user of the model was the doctor, and the patient was the consumer of the outputs. Alternatively, for a streaming service recommendation algorithm, such as on Netflix or Spotify, the company both creates and uses the model to predict what movies or songs a person will like and the watcher/listener is the model consumer.
When assessing the impact of error, it is important to understand the cost differences for different roles. A model creator is likely looking at aggregate costs and impacts. For instance, trying to make the best model at predicting breast cancer. A user of the model may or may not have direct costs of error. A doctor has little downside from using a model to the extent that most other doctors use a similar model. The consumer of the model, the patient, suffers the greatest from errors. Patients have maximum exposure to errors, which is why they need to heavily advocate for themselves and trigger secondary processes to reduce false positive and false negatives. A very different impact than a poor Netflix suggestion.
There are also cases where one can benefit from model error. Those areas tend to be in exploration and combination, such as idea generation, or in the granting of access, such as obtaining a loan. Still others can become an indirect beneficiary or victim of model usage. Mark Spitznagel, the founder of the Universa hedge fund, has made a career from exploiting the errors in models used by other hedge funds. Alternatively, society as a whole has suffered from widespread use of social media ranking algorithms, regardless of if they use these systems or not.
Mitigations
To reiterate, all AI and machine learning models have error across their outputs. Quantifying and understanding the performance of a model is central to determining how it can be used. As a model does not exist in a vacuum, one needs to figure out how to apply processes and mitigations around the usage of a model to deal with the eventual errors that arise.
PromptWerk has developed an incredible automated software engine and is the first VC backed company where all the code used by the company is 100% generated by AI. However, its process is not without error. In fact, about 50% of each coding attempt its AI makes does not result in the correct code. The impact of this error is small since the AI can be triggered to try again. The goal is to have 100% of coding attempts be successful, but until that point, PromptWerk needs to have several methods to deal with errors until this goal is reached. Typically, these methods are called mitigations, and they are fundamental to dealing with error prone processes.
Implementing AI systems requires both active and passive mitigations to be successful, similar to what PromptWerk has done. When considering mitigations, the following strategies can be employed:
Developing second looks: In times of uncertainty, we ask for a second opinion. One mitigation is to kick off secondary predictions or processes when a model is uncertain.
Sensitivity analysis: Many models are highly sensitive to inputs. One can gain an understanding of how sensitive a model is to an input by slightly varying the inputs nominally to see how the model output changes. Understanding model sensitivity is helpful when designing processes around highly sensitive points or areas.
Create multi-step processes: In medical testing, when a positive test result is obtained, depending on the severity there are typically additional confirmatory tests that are performed.
Fallback methods: When models fail or people are prone to errors, a typical fallback method is having a human-in-the-loop to review and correct the outputs. This ensures there is always a safety net for critical applications.
These mitigations help in managing the risks associated with deploying AI and machine learning models, ensuring that the systems remain robust, reliable, and trustworthy even in the face of inherent uncertainties and errors.