Moving at the Speed of Belief

Proving the success of a machine learning system

Jan 27, 2023

“A belief is essentially a point of view that we hold to be true.”

― Thomas E. Kida

Imagine you're a racer (running, bicycle, F1, etc) and you and your team have spent years steadily improving performance at roughly 1% a year. These improvements have been forged through intense competitions with many of your competitors making similar gains. Then someone comes along claiming they can improve your performance by 10% through some new technology. Would you believe them? What about 20%? 50%? 100%+? What would it take for you to believe them? That's the challenge machine learning practitioners have when implementing new models that perform at a much higher level than what organizations are used to.

The factor that controls the pace at which a machine learning system is successful is a stakeholder's belief that the system will be impactful. If we take an inversion of that statement, a lack of belief will slow down the acceptance and final implementation of a machine learning system. As a machine learning operator, you need to be prepared for these challenges and have tools in place to help establish belief. In order to help make your implementation a timely success, let's talk through how to make stakeholders believe.

The Why

Machine learning is ultimately about automating decision making. This automation provides an organization with a way to level up and do things that weren't feasible before, thus enabling their people to focus on better things. Machine learning systems are typically implemented to improve one of the following:

Enable decisions to be made at a scale and speed that are typically not humanly possible (10-1,000,000 decisions per second)
Improve overall decision making quality and consistency
Lower the overall cost of labor

For these reasons, due to their power, machine learning systems are typically used to replace a critical process of an organization. However, this all comes with risk. If the system is not implemented well (or even if it is), there is a possibility that an error or failure brings the organization to a halt, or the system might vastly underperform over time. This potential downside causes organizations to be weary and to find ways to manage their risk. At the end of the day, it is hard to hold a machine accountable. Therefore, organizations will use several layers of risk management and gating to ensure that a system is performing as expected.

The Mind of a Stakeholder

Stakeholders are trying to manage risk as they upgrade their existing processes and systems. New systems mean new potential points of failure with which they are unfamiliar. In an ideal world, they get a better, faster system without any negative changes in how they have been doing things. Obviously, that's impossible. Partly because nothing works perfectly right out of the gate and partly because stakeholders typically don't understand the full implications and second order effects of implementing a machine learning system. For instance, when you have the ability to make decisions at little to no cost, you will make many more decisions than you did before as Prediction Machines by Ajay Agarwal discusses.

Stakeholders will simultaneously believe and disbelieve results as a machine learning system goes through testing. They will believe because they initially advocated for the system and need to signal that they are making quality decisions. They will disbelieve results for two reasons. The first is that they need it to work and do not want the implementation to fail; so stakeholders put up gates of disbelief to ensure that it works properly. The second reason stakeholders disbelieve is that the scale of the results is typically beyond what their minds think is possible. Not because they don't believe the data (although this does happen frequently) but because they have worked in an area for so long that they have an intuitive understanding of what should be possible. However, that stakeholder mindset tends to be based on a view of the world that does not include automation or the ability to incorporate more information than the human mind can process. Additionally, there has been a lot of missed expectations around machine learning that did not live up to the hype, typically due to poor implementation or problem formulation. All of this makes stakeholders worried and rightfully so. Your job as a machine learning practitioner is to help manage the risk to the organization around the implementation.

Managing Risk

Risk management is deeply embedded into the machine learning process by the ways models are validated. However, machine learning practitioners do not always think about their validation methods in the context of managing risk. In order to have a successful deployment, you need to proactively have conversations with stakeholders about how to manage the risk of the implementation. The goal is to continually show stakeholders that they can believe the performance of the system by understanding their fears up front and co-creating testing frameworks with their buy-in before the project starts. By understanding what stakeholders need to see in order to believe performance ahead of time, you can prevent arguments later about whether something works or not. Let's talk through the stages of a machine learning project and how we prove results.

Modeling
Sandbox testing
Live testing
Continuous monitoring

Modeling

Validation in the modeling phase should be most natural to machine learning practitioners. A key part of machine learning is setting up a modeling methodology to ensure that you can trust your results. This is typically called cross-validation. Relevant schemes have shifted over time based on the size of data being used. The main question of this phase is

How can we believe that the results of this model will persist when it goes live?

In this phase, stakeholders typically do not see ongoing results as models are being developed. However, it is important to align on the metrics used to assess models ahead of time. In an ideal world, the metric used to assess the performance of a model is the same as will be used to measure the impact of the system, although this is not usually the case. For instance, you might want to measure the revenue improvement from picking loans less likely to default but most machine learning classifiers are easier to gauge by determining how accurately a loan will default. The difference between accuracy of default and revenue improvement creates a mismatch in how the created system is judged. In order to get stakeholders on board you need to talk about the differences and attempt to make a mapping between the metric to assess a model and the final metric used to judge the impact of the system. How should you align with stakeholders at this early stage?

Use the right metric - the metric should be as close to the final measure of performance as possible but typically proxies are needed.
Establish benchmarks - you should have an understanding of where the organizational performance currently is and how similar systems tend to perform.
Be cautious of very high performance early on - very good performance out of the gate typically means you have an error somewhere or the problem is very simple.
Understand where error is occurring - no model is perfect. The better you can describe to a stakeholder where a model does well and where it does poorly, the more they will trust your implementation and can create systems to manage deficiencies.

Sandbox Testing

In this phase, a machine learning system is setup to mimic a live action system without being exposed to real world actions. This is an intermediate stage to smooth over any discrepancies between how the process was modeled and how it might perform when it goes live. Simulation of live data might occur or previously unseen data may be used. However, it is not only the model that needs to be tested, but also the infrastructure around the model that should be tested. A very accurate model might be great but if it takes an hour to get a prediction it may be unusable. Ideally, you know what the broader system requirements are upfront (beyond model performance). The question you should ask stakeholders is

How should the system be tested and how well does it need to perform for you to believe that it will be impactful?

This is a key step as you want stakeholders to establish what they need to see in terms of performance for the system to perform well. Getting this buy-in early is crucial. I’ve seen many stakeholders move the goalposts or create a lot of scope creep when they start to see how well a system performs. Your goal is to get them to determine what good performance would be when their mind is not exposed to early results.

Live Testing

Live testing is the phase where the model hits the real world. There’s a graveyard of models that have failed at this point and stakeholders are right to be worried. The question you should be asking stakeholders at this point is

How much are you willing to risk in order to identify a measurable impact?

This could be in terms of things such as website traffic diversion, portion of checks to re-run manually, ad budget to allocate, or monetary impact of wrong predictions. The goal is that you risk the minimum amount required to show desirable impact. Once that milestone is established and achieved, you can begin to ratchet up the use of the model in stages. The number, duration, and size of stages will be based on how risk tolerant the stakeholder is and the potential downside if something goes wrong. If all goes well, the system is fully implemented and you can move to the next phase.

Continuous Monitoring

At this point the live testing was a success and the machine learning system has been able to prove itself. However, all is not done. While stakeholders will likely be pleased at this point, they want to avoid one hit wonders. What stakeholders really care about is the system achieving consistent, dependable results over time. This naturally leads to the main question of this phase, which is

How consistent do results need to be and how impactful should they be in order to judge the system as a success?

At this point you want to agree on two measures. The first measure should be total impact of the system. The second measure should be consistency of impact. Ideally you establish this with stakeholders before they see multiple results, so that it does not skew their setup. It is also wise to setup some thresholds to trigger a review if system performance suddenly drops. If you’re successful here, congratulations! After this point, an organization will start to change how they operate and optimize themselves by having a dependency on the machine learning system that you have implemented.

Achieving Belief

There are a lot of obstacles and resistance along the way to implementing a machine learning system well. Some of these are justified and some of these are not but you need to address all of them. By making a stakeholder believe through understanding that a machine learning system is working well, it will be successful. Doing this ultimately comes down to managing risk; co-creating with stakeholders the systems for testing, and getting stakeholder buy-in before proceeding down the path of implementation. The risk management framework and decisions for success should ultimately answer the following questions:

How can we believe that the results of this model will persist when it goes live?
How should the system be tested and how well does it need to perform for you to believe that it will be impactful?
How much are you willing to risk in order to identify a measurable impact?
How consistent do results need to be and how impactful should they be in order to judge the system as a success?

If you can make your implementation answer all of those questions and have your machine learning system perform at the required levels you will have achieved success and your stakeholders will believe that as well.

Embracing Enigmas

Discussion about this post