An effective goal-setting system starts with disciplined thinking at the top, with leaders who invest the time and energy to choose what counts. — John Doerr
Note: This comes out of a deep discussion with Rob May about the future.
As organizations begin deploying AI agents, customer-facing assistants, research copilots, code generators, process managers, and embedded automations, they need a better metric to manage these applications. Not only to know whether the systems are working, but also to know how well, at what cost, and compared to what. Classic metrics can breakdown when the performance and cost of a system is based on how much time it spends thinking.
What’s needed is a clearer lens. A way to elucidate whether an outcome happened, as well as, the quality, and what was required for the system to produce it.
Intelligence per watt is a recasting of a classic ratio - output per cost, but modernized to the current environment. It replaces vague notions of productivity with a richer understanding of what intelligence looks like in practice, and it swaps out financial abstractions for a physical truth - energy draw. In a world of increasingly autonomous systems, this metric offers more than just comparison, it offers control. Let's break it down and see how we get there.
From Output per Cost to Intelligence per Watt
This isn’t a brand-new idea. It’s an evolution. You don’t simply want to know what got done. You want to know how much intelligence it took, and how much energy it burned to get there.
It moves from the transactional to the functional. “Output” is replaced by intelligence a more useful, nuanced way to measure how well a system performs a task. “Cost” is grounded in watts, the clearest, most universal unit of energy consumption, uncorrupted by pricing models or cloud invoices.
This metric becomes especially important when comparing agents with different architectures, strategies, and tradeoffs. One might take 10 careful steps and deliver 95% accuracy. Another might take 4 scrappy ones and land at 88%, but use a fraction of the energy. Which is better? The answer depends on context, and this metric gives you a way to reason through that choice.
Measuring Intelligence
There are a lot of leaderboards available that attempt to measure the intelligence and performance of different models. However, these are 'general' measures of a model and a 'general' benchmark doesn't correlate directly to a model's ability to solve your business problem. Our notion of intelligence is not a model’s ability to ace a logic puzzle or imitate human language beautifully. We're looking for means task-relevant intelligence, how well a system accomplishes a goal within a specific environment.
You can’t abstract intelligence from its context. A large language model might write compelling text, but fail at real-time decision-making. A rule-based heuristic might outperform GPT-4 in a narrow logistics workflow. No amount of logical prowess helps in emotional contexts. Robert Nash and Sheldon Cooper show us where raw intelligence can fail. Context is not a wrinkle, it’s the foundation. That’s why our metric uses a definition of intelligence that accounts for the environment:
Where:
Task Complexity (coefficient factor)- how complex is the task? What level of analogous intelligence (5 year old, PhD, crow, etc) is required to solve this?
Outcome Quality (dollars or time or energy) - how effective is the output that the system provides? This can also include a measure of how many iterations are required. This should be in the same units as Error Cost.
Error Rate (percentage) - what is the rate of error or mistake?
Error Cost (dollars or time or energy)- when the system makes an error or mistake, what does it cost? This should be the same units as Outcome Quality.
Throughput (items/hour) - how many tasks/items/objectives are processed in a given unit of time?
Why this measure? We're trying to measure the system to determine how quickly it can provide output of great quality with minimal error. A higher performing system is going to be able to handle more complex tasks with a higher level of quality. At the same time, when examining automated systems, understanding the cost of an error is paramount. The cost of an error controls the usability and structure of the system. The error cost gets accentuated by the rate of error. We combine these four aspects into the equation above and multiply by the throughput to see how the system scales. This gives us an overall measure of intelligence. Which in our case is related directly to the task at hand.
A chatbot that handles 100 inquiries an hour with 10% error might outperform one that handles 20 inquiries at 2% error, especially if the mistakes are low-cost (misspelling a name) rather than catastrophic (giving medical advice). Throughput, error cost, and environment must all be factored in. Otherwise, you’re optimizing for the wrong thing.
The process matters too. One agent might take 4 steps to solve a task. Another might take 10. The 10-step agent might seem “more intelligent,” but if it consumes 5x the energy and takes 3 times as long, only to deliver marginal improvement, its intelligence per watt is lower. Fast, lean reasoning, even if imperfect, often wins in practice.
Why Watts?
Why use watts as a measure instead of something else? Watts expose the core of consumption that other measures, such as dollars, can obscure. For a given system architecture, a given output will always cost the same number of watts but the dollar cost of that energy may change. Whether it’s a brain or a model, every act of intelligence, every decision, consumes energy. Watts show you the metabolic truth of cognition. Watts also have the benefit of measuring a flow which aligns with another flow we are interested in - throughput.
Watts are about limits. They define where intelligence can live. The human brain runs at roughly 20 watts. A GPU runs anywhere between 400-1500 watts. An arduino runs at about 1 watt. Your smart phone runs at 0.5-5W. A system that draws 15,000 watts to think clearly might work fine in a datacenter. But it can’t run on a drone, or a wrist, or a remote sensor in a jungle. It can’t scale to the edge. It can’t operate off-grid. Watts highlight constraints you can’t ignore. They tell you whether a system is deployable, survivable, or adaptable. And those attributes matter more as intelligence spreads into the physical world.
So, when we say intelligence per watt, we’re not just proposing a performance metric. We’re naming a relationship between a task and reality. Between what a system can do, and what it takes from the world to do it. Watts give us a way to see cognition more truthfully. They help us manage not only what our systems produce, but also how they exist, and whether that existence is sustainable, portable, or even justifiable at scale. In the end, intelligence isn’t just about being right. It’s about being right within constraints, within your energetic means. Watts help us see whether the systems we’re building are merely smart, or actually fit to live.
But What If You Can’t Measure Watts?
If we breakdown intelligence per watt to the raw units we get:
The coefficient and percentage are unitless and dollars cancel out. This gives items per watt-hour. Essentially intelligence per watt is measured in how many items you get for a given level of consumption, as you purchase power in watt-hours. That's the same as when you buy cloud computing, except you have a coefficient for the cost of compute applied to the variable cost of the electricity. Ultimately, getting a certain amount of compute per the cost of electricity.
Here’s the rub: most organizations don’t get wattage readouts. They get API pricing or cloud compute charges. So how do you use this metric? This is where assessing changes becomes your best signal. When the measurement is noisy, focus on the delta. You may not know the absolute number of watts a system consumes, but if you track the change in intelligence per dollar, per latency unit, or per token over time, you start to see patterns. If your outcomes are improving without increases in cost or latency, your intelligence per watt is likely improving. If a new system does the same job faster, cheaper, or at lower resource cost, it’s almost certainly more efficient. Alternatively, if it's hard to get directly to watts, create a test task that you can use to benchmark new systems and use it to create a normalizing baseline. However, ideally you can figure out intelligence per watt through reverse engineering of costs and the estimates of how fast a request is fulfilled combined with the number of machines it took to create.
Managing with Intelligence per Watt
Management is where the metric becomes real. What does it let you do differently? Here’s a framework for using intelligence per watt as an operational lens:
Create AI System Profiles - Track task complexity, outcome quality, error type/cost, throughput, and latency across agents. Use architecture-level estimates to approximate wattage. You don’t need perfect numbers, just enough signal to track performance over time.
Benchmark Architectures, Not Just Outcomes - Run the same task across different agent types, LLMs, heuristics, embedded models. Compare success-per-cost and improvement-per-watt. You’ll often find smaller, simpler agents outperform their heavyweight peers on domain-specific tasks.
Loops and Robustness - Don’t solely measure peak performance, measure how efficiently the system loops. How often is it right on the first try? What does it take to recover from a mistake? Favor systems that improve with each cycle, not those that impress once.
Remove Energy Hogs - Look for agents that deliver only marginally better results at significantly higher energy or dollar cost. Sunset them early, or refactor to isolate the intelligent core.
Plan for the Long Term - Use intelligence per watt as a long-term KPI. Optimize architectures that deliver more with less. Push intelligence to the edge. Build systems that survive real world constraints.
Designing for Fit, Not Force
The deeper question here isn’t what’s performant, but what’s appropriate. The most powerful intelligence may not be the smartest or biggest. It may be the one that fits best. Intelligence isn’t defined by size; it’s defined by what survives in a given context. Nature is already doing this. Bees, bacteria, and plants all “compute” in ways we still barely understand, yet thrive with elegance.
So, the question isn’t can we build smarter systems? The question is can we build systems that think at the right scale, with the right rhythm, at the right cost to the world? Because in the end, intelligence per watt isn’t merely a metric. It's a question - what kind of intelligence do you want to scale?