Being Unprepared for Infrastructure Failures

Contingency planning for when AI is routine

Mar 13, 2023

If you can keep your wits about you while all others are losing theirs…

- Rudyard Kipling

Infrastructure failures are always fascinating. There’s something surreal about experiencing the moment a critical component fails, especially one many people didn’t realize they were depending on. I spent the early part of my career investigating why buildings break and how to fix them. Failures happen much more often than you would think. Nothing can be built to perfection. The way most engineering works is by breaking things to figure out how strong they are, how they work, and how to make them better. There’s a lot of testing and understanding of how components and systems interrelate. A key takeaway from that time of my life is that while it may seem more expensive up front, it always costs less for prevention than to recover from a failure. The other learning is that most people are vastly underprepared for when infrastructure fails, which is why engineers and planners build a lot of contingency plans and risk mitigation systems.

Let’s discuss how these systematic failures should impact the way you think and plan. Most people don’t have contingency plans for what happens if the underlying infrastructure they rely on fails. These risks also lurk under the surface for the AI systems being built today. To give context, let’s start with a current example - Silicon Valley Bank (SVB).

SVB

As you might have heard, Silicon Valley Bank (SVB) made some poor moves and got closed down by FDIC this past Friday after many startups and VCs tried to pull their money out of the bank Thursday, with about a quarter of all deposits being pulled. VCs and startups that had high exposure with SVB are at a high risk of insolvency or meeting payroll until regulators stepped in to ensure no depositors lost money. SVB’s assets will be sold off through an auction. While it looks like everything will turn out alright for depositors, there was a lot of stress for those affected for about 72 hours. Thankfully, regulators were able to provide fixes quickly but only could do so after many years of working through prior failures in the financial system. How did this event start? The bank run started when many high-profile firms advised their founders to secure their future by pulling their money:

Founders Fund and other firms reportedly advised their portfolio companies earlier today to withdraw their money. Even VCs expressing support for the bank must have been doing the same privately, lest their portfolio companies risk losing their precious capital.

and

Garry Tan, the president of Y Combinator, reportedly sent an internal message to the many founders in the program today, reminding them the FDIC only insures deposits up to $250,000 in case they wanted to get their money out of Silicon Valley Bank: “We have no specific knowledge of what’s happening at SVB. But anytime you hear problems of solvency in any bank, and it can be deemed credible, you should take it seriously and prioritize the interests of your startup by not exposing yourself to more than $250K of exposure there. As always, your startup dies when you run out of money for whatever reason.”

The whole event was really caused by a poor understanding of risk by SVB and poor portfolio management of their funds. Matt Levine writes:

People kept flinging money at SVB’s customers, and they kept depositing it at SVB. Perfectly reasonable banking service.
But the customers didn’t need loans, in part because equity investors kept giving them trucks full of cash and in part because young tech startups tend not to have the fixed assets or recurring cash flows that make for good corporate borrowers. Oh, there is some tech-industry-adjacent lending you can do. Tech founders want to buy houses, and you can give them mortgages. Venture capital and private equity funds want to manage liquidity and/or juice their reported return rates by paying for investments with borrowed money rather than drawing from their limited partners, so you can get into the capital-call-line-of-credit business.
...
But there is a basic imbalance. Customer money keeps coming in, as deposits, but it doesn’t go out, as loans.
So you have all this customer cash, and you need to do something with it. Keeping it in, like, Fed reserves, or Treasury bills, in 2021, was not a great choice; that stuff paid basically no interest, and you want to make money. So you’d buy longer-dated, but also very safe, securities, things like Treasury bonds and agency mortgage-backed securities.
...
The result of this is that, as the Bank of Startups, you were unusually exposed to interest-rate risk. Most banks, when interest rates go up, have to pay more interest on deposits, but get paid more interest on their loans, and end up profiting from rising interest rates. But you, as the Bank of Startups, own a lot of long-duration bonds, and their market value goes down as rates go up. Every bank has some mix of this — every bank borrows short to lend long; that’s what banking is — but many banks end up a bit more balanced than the Bank of Startups.
...
But there is another, subtler, more dangerous exposure to interest rates: You are the Bank of Startups, and startups are a low-interest-rate phenomenon. When interest rates are low everywhere, a dollar in 20 years is about as good as a dollar today, so a startup whose business model is “we will lose money for a decade building artificial intelligence, and then rake in lots of money in the far future” sounds pretty good. When interest rates are higher, a dollar today is better than a dollar tomorrow, so investors want cash flows. When interest rates were low for a long time, and suddenly become high, all the money that was rushing to your customers is suddenly cut off. Your clients who were “obtaining liquidity through liquidity events, such as IPOs, secondary offerings, SPAC fundraising, venture capital investments, acquisitions and other fundraising activities” stop doing that. Your customers keep taking money out of the bank to pay rent and salaries, but they stop depositing new money.
...
As Armstrong puts it, SVB had “a double sensitivity to higher interest rates. On the asset side of the balance sheet, higher rates decrease the value of those long-term debt securities. On the liability side, higher rates mean less money shoved at tech, and as such, a lower supply of cheap deposit funding.”

How did this happen so suddenly and take people so by surprise? Trust, misunderstanding risk, and good/bad asymmetry. Many people trusted their funds with a reputable bank and didn’t think much about having to diversify their funds into different banks. This trust was built on them not experiencing or being aware of any possible failure events. Which meant users of the bank misunderstood the risk of using only a single bank, along with the amount of risk the bank was actually carrying. Lastly, and this seems like a universal law, when bad things happen they tend to happen very quickly whereas it takes more time to build up good things.

Why Failures Surprise

I have a lot of friends that work in core infrastructure (electrical grids, energy production and transport, wastewater, buildings, financial systems, etc) which allows the world to run. The great part about this infrastructure is that it helps us to do more things by saving us time or giving us capabilities we didn't have before. However, because that infrastructure allows us to do more, we take mindshare away from worrying about the risk of keeping our infrastructure operational. Our economy pays people to worry about those risks and figure out how to keep those systems running for us. However, because the mindshare is now concentrated among the few, most people think everything is fine until it is not and they find themselves vastly unprepared.

Why do these failures surprise? Because you've put a bunch of trust into having a system just work. No piece of infrastructure, whether a bridge or cloud servers or a payment system are built to be perfect without interruption. They all have a service life and engineers work tirelessly to figure out how to have these systems fail gracefully along with contingency plans. However, most users of these systems are unaware of the various failure modes and implications. If you have a beat-up car running on its last legs, you are going to plan your trips and checks on your car differently than if you have a new car. When the beat-up car's engine fails on the highway, you already have a plan for what you'll do. If the new car's engine fails, your entire week just went down the toilet. Not being prepared for a failure that is critical to how you live your life is going to have much more damaging impacts. You're typically only prepared if someone has taught you to be or if you have lived through a similar of failure before.

AI Infrastructure

How does all of this relate to AI? A few ways. First, if you look over the market landscape right now there are a ton of different startups relying on the technology of a select few foundation model providers aka AI utilities. Most of these startups are simply thin wrappers around this underlying technology. It is really only a matter of time before the likes of OpenAI, Midjourney, Anthropic, and others take the best of what these creative studio startups have built and cut out the middlemen. Each of these foundation model providers know every call that is being made to their system and every return result. It's much easier for them to see which use cases are working well and which ones are having repeated calls. These AI utilities will have a better understanding of the market needs than any single tool. Maybe they will make some acquisitions of these companies, mainly for the customers and to increase their growth. So whether these thin wrapper companies know it or not, they are sitting on a ticking time bomb of infrastructure that is going to leave them vastly unprepared if they are not diversifying. Perhaps they are making a bet that they can diversify before it is too late.

Second, it's pretty clear that AI is changing the way people work but what happens when an AI system changes, fails, or goes offline? There are multiple reports of AI products increasing the productivity of workers, whether programmers or writers, by 30-80%. As people become accustomed to using these tools, it will change both how they work and their baseline efficiency. That's going to allow companies to potentially trim their workforce. What happens when that tool disappears or raises their fees? Are workers going to be stuck or is the product so inelastic that monthly fees will rise to thousands of dollars? Or does a slower pace of progress make your business model no longer sustainable? Alternatively, we’re seeing a lot of variability in the response time for these APIs. Does your application still function at 3-4x longer response time? Can you keep customers with these conditions? Can your team even work if they rely on these tools?

Third, there are some other nebulous risk issues lurking under the surface. If you are relying on the outputs from foundation models you have no control over, you are subjecting your processes to planned or unplanned bias in these systems. For some use cases, it probably doesn’t matter. For others, you may be inadvertently pushing political agendas you did not realize the models were pushing. You should be building output verifications into your systems.

Finally, as more AI generated content permeates the world, expect a bad feedback loop as new generated content becomes poisoned data to existing foundation models. Meaning that the outputs of foundation models will become noisier and less reliable as the underlying data distributions used to train these models shift. To be clear this is a hard problem to account for. Most non-adaptive machine learning systems built today struggle to handle feedback loops from their interactions with the world.

Impacts and Courses of Action

So if you're currently working in the froth that is the AI markets, what should you do?

Diversify your product with multiple service providers. Don't rely on a single system to provide a core functionality to your product. Besides making your system more robust it also helps distribute data across providers and reduces the insights they can gain.
Collect data that others can't. If you have a data advantage, it means it will be harder for an AI utility to undercut you. Additionally, that special data can be monetized in different ways or be used to launch other useful features or products.
Focus on verification. Build systems to verify the inputs you are using and monitor your inputs for purity. You will want to understand if these inputs are changing over time and what impact they will have on your systems. You should also figure out how to deal with feedback loops from your systems’ interactions with the world.
Stress test your systems and processes. Go through the exercise of what happens to your business if one of these providers goes out for a day, a week, or indefinitely. If you cut your programming and marketing staff in half because AI tools made them twice as efficient, does your business still work if those tools disappear overnight?
Embrace chaos engineering. Chaos engineering was evangelized by Netflix. The idea is to cause random failures in your infrastructure to make it better and reduce downtime for users. It works really well. How can you do it in a low-risk way? Map out your business as a system and simulate random removal of nodes or increase the time it takes to get a response from a team or vendors. Does your business still function? What do you have to change to make it more robust?

Let me know if you would like for me to take a deeper dive and explore specific risk mitigation examples.

Embracing Enigmas

Discussion about this post