“No amount of source-level verification or scrutiny will protect you from using untrusted code”
-Ken Thompson
Every day we make hundreds of decisions. When working in teams, we put safeguards on each other to make sure we have good outcomes and prevent mistakes. So, when we move to automate our decision making process with machine learning it becomes scary. It’s scary because we’re no longer directly in control, and there is an innate fear that machines won’t be able to account for every nuance we see. The tradeoff we are making is increased capacity or a new capability in exchange for less direct control. This works surprisingly well, such as outperforming radiologists at detecting disease.
Alternatively, there are situations where people simply can’t react fast enough. While this example doesn’t require machine learning, think about a car crash. From the time a crash starts to when the airbag is fully deployed is 60-80 milliseconds. This includes the time it takes the computer to detect that a crash is occurring, make a decision on if the airbag should be deployed, and for the chemical reaction to occur that expands the airbag. The average human reaction time is 150-300 milliseconds. Meaning that in the worst case scenario where there is no warning, your car can protect you before you even realize what’s happening. This example shows only a 2-5x improvement but some machine learning system can make judgements at 100-1000x faster than it would take a typical person to make the same decision.
Verifying Machine Learning
Machine learning is ultimately about automating a decision-making process which frees us up to focus on higher level tasks or to build more sophisticated systems. However, in order to deploy these systems into the wild we need to trust that they will work effectively. In machine learning we use multiple methods when verifying models. They reduce down to two categories: verification methods at the time of creation and verification after deployment. During model creation, typically called “training”, we set up a series of tests for the model which includes the creation of the following:
Training data set: Data that the initial formulation of the model will learn from.
Validation data set: Data that is held out from the training set to provide an unbiased evaluation of the model while identifying the optimal model hyperparameters
Test data set: Data on which the “learned” state of the model will be evaluated to determine its level of performance that we expect to see in production
These steps won’t come as a surprise to any machine learning practitioners since it is standard practice. Why? Our goal with these three steps is to gain confidence that the model performance can be trusted and to verify that we have modeled the process correctly. While there is a lot of art and technique that goes into how to do these three steps well, the basic premise is to progress through each set until the final model is reached having obtained consistent performance in each set. It is important that these different datasets are created so as not to leak information that wouldn’t be available at the time of prediction (we don’t want to give an answer key to a test ahead of time) and that the data distributions mimic what we expect to see in the world. Why do data distributions need to mimic the real world? Because models base their inference and predictions on the relationships they find in a given data distribution. That’s why we continue to monitor the performance of models after they’ve been deployed.
The real world is messy. The behavior of people changes over time. People adapt to new systems. Feedback loops make it hard to account for how changes you make will affect a system. That’s why we monitor machine learning models after they have been deployed and are in use. We need to continually verify that they are performing as intended and are still useful for the current state of the world. There is a myriad of ways to do this, but it mostly come down to two types of verification:
Verifying that the model performance is similar to what was seen during model training
Verifying that the data distributions the deployed models use for prediction are similar to what the models were trained on
Models are built on a snapshot of the past and are intended to predict similar effects in the future. Since our modeling process may not have been able to account for all aspects of behavior and because behaviors change, we need to continually assess the performance of our model after it has been deployed. Our goal is to see the same performance in production as we did during training. If there are large deviations, it means something went wrong.
If data distributions are different from what the models were trained on, we should not expect the performance to be the same nor may we expect the models to be accurate. A stark example of this is if we created a model that predicted the next page a visitor would click on as they navigated the site. If we redesign the website so that none of the pages are named the same and the navigation hierarchy is completely different, the predictive model becomes useless. The statistical distribution of how people navigated the old site vs the new site is completely different. In practice, most examples are not as extreme as this but subtle distribution shifts can have big impacts. Especially when a 1% drop in performance can equate to millions of dollars in lost revenue. Overall, these verification processes work well when we can easily quantify performance.
Verifying Artificial Intelligence
The issue with current AI systems is that we have difficulty automatically verifying if the outputs they are producing are what we are intending. Part of this is building alignment into AI systems, but the other part is having sufficient verification processes and algorithms in place to check the outputs. LLMs and generative AI work by finding information in a compressed space and then decompressing that information into something coherent. The issue is there are lots of valid options when you decompress information. Similar to how we apply verification processes on classic machine learning, we are trying to verify that what was created and validated by the machine is also valid to a person.
We use verification processes all the time at work. People have their work verified or reviewed by their manager before it is officially released. It should be no surprise that we need similar systems in place to verify AI. However, it is unlikely that all of these processes can be human-in-the-loop due to the overwhelming speed or scale that is required to match the output of AI systems. Verification becomes even more important as we look to build out automated processes and agents that handle events without human intervention. Some key areas we want to verify are:
Fact Verification: It is well known that current AI models can hallucinate information, whether that is small details from a given set of facts, or making up entire court cases or websites.
Coherency: The output should be logical, following certain guidelines, and the output structure should be such that is can be absorbed readily by another system.
Consistency: If you ask the model to answer a question ten times, does the model return the same exact answer ten times? If not, what changes? Are these minor things or major issues? How are you quantifying the variance of the output along with its impact?
Quality of Output: Does the output meet certain quality standards set by the organization? This could be anything from the tone and voice of an organization to following a company’s brand guidelines to making sure all the images on a website contain pictures of cute kittens.
Ideally, there are post processing steps that examine the outputs to ensure adherence to each of these points. If the output does not meet the requirements, it should either be sent to a separate process for adjustment or call back to the original model to regenerate the response but with modifications. This becomes a large exercise in proper system engineering.
As we begin to automate more creative processes with AI algorithms, verification becomes more important. The example scenario everyone is trying to avoid is that of bad models in a high frequency trading (HFT) system. In HFT, if your models are bad with unverified outputs, you can lose hundreds of millions of dollars in minutes. While most people using generative AI will not be at risk of that amount of impact at high speed, the warning still stands as more systems become automated. The same risk management tools need to be applied. Those that are able to build quality verification systems will be able to obtain a high amount of value.