← Back to Episodes

Agi Scaffolding Paradox

Published: 27 March 2026

[00:00] Ashley: You know that moment when you download a brand new software platform or maybe you log into a completely unfamiliar CRM for the very first time? Oh, yeah. The one where there's no tutorial. Exactly. There is no instruction manual. Nobody is standing over your shoulder telling you where to click. You sort of poke around for a couple of minutes. Right. You click a few buttons. You figure out the underlying logic. Like, OK, this tab does this. That button does that. And suddenly, you know, you're working. You adapt to it.
[00:29] Ray: Yeah, I mean, you are inferring the goal just by interacting with the environment. You don't need a thousand examples of someone else using that specific CRM to know how a search bar works.
[00:40] Ashley: Right. But here is the massive reality check that just hit the tech world. It turns out that artificial intelligence cannot do that. At all. Not even a little bit. The world's most advanced frontier AI models just took this completely new kind of intelligence test and every single
[00:58] Ray: one of them completely failed. It wasn't just a minor stumble either. It was a categorical failure at a task that everyday humans find completely, you know, intuitive.
[01:08] Ashley: Hi, I'm Ashley.
[01:10] Ray: And I am Ray.
[01:12] Ashley: Welcome to Podcast 7. Today, we are doing a deep dive into what researchers are calling the scaffolding paradox. It's a fascinating concept. It really is. We're going to look at why this new AI failure proves that true fluid intelligence simply doesn't exist yet in our frontier models.
[01:29] Ray: And more importantly for you listening, we're going to break down exactly what this means in the world of B2B demand generation and GTM engineering.
[01:37] Ashley: Because... If the AI isn't actually autonomous, the real value isn't the model itself. It's the human engineered workflows and the scaffolding that you build around it. Absolutely. So to understand
[01:49] Ray: why the models failed so spectacularly, we have to look at what this new benchmark is actually demanding that previous tests just missed completely. Right. Because historically,
[01:58] Ashley: models passed these tests with flying colors.
[02:00] Ray: Exactly. For years, we've been testing models on static knowledge. We give them trivia. We ask them to pass the bar exam or we ask them to. write Python code.
[02:09] Ashley: Which they are incredibly good at. I mean, they ace the bar exam. So what makes this new test so fundamentally different?
[02:15] Ray: Well, it's called RRC-AGI3. Yeah. It was launched by the ARC Prize Foundation, which is backed by Y Combinator's winter 2026 batch. Okay. And it was founded by Mike Noop and Francois Chalet with Greg Kamrad as president. And unlike those static tests, ARC-AGI3 is an interactive reasoning benchmark.
[02:33] Ashley: Meaning it tests fluid intelligence. Right.
[02:37] Ray: Instead of a text prompt, the test drops the AI agent into a novel turn-based environment. Visually, it's just a 64 by 64 grid with 16 possible colors. I want to pause on
[02:47] Ashley: that for a second because that sounds deceptively simple. A 64 by 64 grid. We are talking about something that looks like an old school Atari game,
[02:55] Ray: right? Very similar, yeah. But the catch is that there is zero language provided to the AI. Wow. Nothing. No cultural symbols, no text instructions, nothing telling the AI what the actual win condition is. environment relies entirely on what researchers call core knowledge priors.
[03:10] Ashley: And that means like basic intuitive physics. Exactly.
[03:13] Ray: Gravity, momentum, basic geometry, topology, and object. Object. Yeah, the idea that a block of pixels is a solid thing that can move or... you know, be blocked by another solid thing. So it's basically a silent
[03:23] Ashley: video game. The AI is dropped in and just has to
[03:26] Ray: figure it out. Right. The agent has to autonomously explore the grid, figure out what the goal even is, build an internal mental model, and then execute a plan to win. It has to learn from raw experience on the fly. And this
[03:39] Ashley: brings us to the shocking results. The ARC Prize Foundation ran human testers through these 135 novel environments. To normal people, right? Yeah, just everyday people off the street. San Francisco sitting down for 90-minute sessions, 100% of the human testers solved the environments on their first try. Every single one of them. Every single one. But then they ran the absolute best frontier AI models on the exact same
[04:02] Ray: environments. And the numbers are staggering. Gemini 3.1 Pro Preview scored Banner 0.37%. Ouch. GPT 5.4 scored 0.26%. Claude Opus 4.6 got 0.25%. And Grok 4.2 scored a flat 0%.
[04:20] Ashley: Less than 1% across the board. I mean, to ground this in a real world B2B analogy for everyone listening, think about the current large language models like highly educated interns. Oh, that's a good way to look at it. Right. These interns have literally memorized every textbook, every sales methodology, every bit of code ever written. But the absolute second you drop them into a brand new custom built CRM with no SOP, they freeze.
[04:48] Ray: Because a human intern. figures out the new interface by poking around. Yeah. The AI just halts. It doesn't know what to do with that explicit instructions. Exactly. And, well, the core mechanism behind that failure is how LLMs are actually built. They are next token predictors. Right.
[05:04] Ashley: They just predict the most likely next word based on billions of human language patterns. Exactly. They don't
[05:08] Ray: actually see shapes or understand gravity. If there isn't a text-based workaround to explain the spatial relationship of a grid, their architecture just completely collapses.
[05:18] Ashley: They can't reason visually the way a human brain intuitively does. But, you know, when I hear scores that low, literally 0% or fractions of a percent, my first instinct isn't that the AI is dumb. What is it then? My instinct is that the grading rubric is fundamentally broken. Are they grading this fairly? Or is the test just rigged to make AI look bad?
[05:41] Ray: That is the exact debate sweeping the industry right now, especially highlighted recently in the Neuron newsletter. Okay, what are they saying? So the ARCAGI-3 test uses a metric called Relative Human Action Efficiency, or RHAE. This metric doesn't just measure whether the AI eventually solved the puzzle. It measures how many actions the AI took to solve it compared to a human baseline. Okay. And they define that baseline as the second best human tester, dropping the absolute best score to remove outlier
[06:10] Ashley: speedrunners. Okay. Well, that sounds reasonable so far. You want to see how efficient the AI is compared to a highly proficient human?
[06:16] Ray: Right. But the controversy comes from how they calculate the penalty for inefficiency. It's a squared efficiency penalty. A squared penalty. Yeah. So let's say the human baseline completes a level in 10 actions. If the AI agent flails around trying to explore the environment and takes 100 actions to solve it, the AI does not get 10% credit.
[06:37] Ashley: Because the penalty is squared. Exactly. Because it's
[06:40] Ray: squared, it gets 1% credit. Wait, hold on.
[06:42] Ashley: I'm not a mathematician, so walk me through this. If a human takes 10 steps... and the AI takes the 100, so it takes 10 times as long, why drop the score all the way to 1% instead of just giving it a proportional 10%? I know, it sounds harsh. That feels incredibly punishing.
[06:58] Ray: It is punishing, and that's the whole point. The creators are trying to penalize brute force guessing. If an AI just randomly clicks every possible combination until it stumbles onto the right answer, it isn't displaying intelligence. It's just using massive
[07:13] Ashley: compute power to guess. Right, that makes sense.
[07:15] Ray: So the squared penalty destroys the score of any model that relies on brute force instead of actual reasoning. And to make it even stricker, the AI scores are capped at 1.0x, the human baseline.
[07:28] Ashley: Meaning if an AI magically finds a brilliant shortcut and solves the level in two actions instead of 10, it doesn't get 500% credit. No, it just gets 100%. The
[07:37] Ray: system gives no extra credit for shortcuts because the goal is to measure generalizability across all games, not exploiting a single level. Yeah.
[07:46] Ashley: Okay, I can see why critics are up in arms. In the Neuron, users like Scaling0.1 argue that this methodology is intentionally designed to
[07:54] Ray: force low scores. Yeah, they claim a squared penalty is too harsh and that capping the score prevents AI from showing superhuman efficiency.
[08:01] Ashley: But Francois Chalet fired back on that. His point is that if we are ever going to achieve true AGI artificial general intelligence, it shouldn't require massive human handholding or pre-existing scaffolding to figure out an environment. You know, I actually
[08:14] Ray: have to push back on the critics complaining about this scoring too, especially when we look at this through a go-to-market lens.
[08:20] Ashley: Oh, really? Break that down for me. How does a visual grid puzzle relate to an outbound campaign?
[08:27] Ray: Well, think about action efficiency and lead generation. Extreme inefficiency isn't just a little bad. it is catastrophic. That is very true. If a human SDR needs to send 10 highly targeted, well-researched emails to book one meeting, but your AI agent needs to blindly send 100 poorly calibrated emails to book that same meeting. The AI isn't
[08:50] Ashley: just less efficient. Exactly.
[08:52] Ray: That inefficiency burns through your limited budget. It triggers email spam filters. It ruins your domain
[08:58] Ashley: reputation instantly. Oh, man. Yeah, instantly.
[09:01] Ray: In the real world of B2B sales, taking 10 times as many actions to achieve the same result is a complete failure of the system. So the penalty isn't just justified. It's a perfect reflection of business reality. That is a great point.
[09:12] Ashley: The cost of an action in the real world is almost never free. Every API call, every email sent, every database query has a cost, a risk, or a latency attached to it. Exactly. Which brings us to a huge pivot point. If the models themselves are fundamentally this inefficient in novel environments, why are we seeing any success at all with AI and B2B?
[09:33] Ray: Well, it completely changes how professionals need to view their tech stacks today. It introduces us to what researchers call the scaffolding paradox.
[09:42] Ashley: The scaffolding paradox. Yes. The magic
[09:44] Ray: isn't actually the AI model. It's the harness holding the model up.
[09:48] Ashley: Let's get into these harnesses. During the ARC AGI testing and the preview competitions, researchers realized that the AI models only performed well if humans built elaborate, domain-specific scaffolding around them.
[10:00] Ray: Yeah, the academic partnerships reveal this clearly. For example, a team from Duke University built a specialized harness just to help the AI manage its own
[10:07] Ashley: interaction history. Because the environment frames are 64 by 64 grids, an AI model quickly exhausts its context window if you just feed it a running history of everything it has seen. Right. Give me a visual for why that happens. Why does a grid exhaust a context window so fast?
[10:22] Ray: Okay, imagine trying to play chess. But instead of just looking at the board and intuitively seeing the pieces, someone is reading you a list of thousands of numerical coordinates every single time you want to make a move. Oh, that sounds awful. Right. Pawn is at e4, knight is at f3. Empty space at G2 over and over again. You'd lose your mind. Yeah, absolutely. That's what happens to an LLM. Gets overwhelmed by the
[10:46] Ashley: raw data. So what did the Duke team actually do
[10:49] Ray: to fix that? The human researchers had to build a Python-based system that allowed the AI to selectively retrieve its own past actions. Wow. Another group built an orchestrator sub-agent architecture where a top-level manager delegates tasks to sub-agents just to keep the AI from getting confused. That's crazy. The AI wasn't... Smart enough to remember its own history efficiently, so a human had to code a memory retrieval system for it. And when we
[11:14] Ashley: directly connect this to GTM engineering, it perfectly explains the current landscape of sales tech. Think about the tools you are probably using right now. Clay. Smart Lid. Instantly, Lemlist. The actual intelligence driving a modern outbound campaign in those platforms isn't the underlying
[11:33] Ray: LLM. No, not at all. The LLM is just a highly capable text processor sitting in the middle of a much larger workflow.
[11:41] Ashley: The real intelligence is the GTM engineer. It's the human who is building the intricate logic gates, the complex data enrichment waterfalls, the signal detection rules. Exactly.
[11:52] Ray: The human decides exactly which intent data to scrape, how to structure the prompt, and what the final output should look like.
[11:58] Ashley: The AI is just a gear in a machine built entirely by a human. [SPONSOR] And by the way, if you are navigating this exact challenge and trying to build that kind of advanced architecture for your own pipeline, you should look into GTM7. It's where GTM engineering meets AI execution. You can check them out at gtm7.ai. And you can also check out Demand7.ai, which is where AI meets demand generation. Two great resources. And,
[12:21] Ray: you know, this reality of human-built scaffolding completely aligns with a very provocative argument made by NYU professor Sainichi. Also featured in The Neuron, right? Yes. He argued that LLMs are actually anti-bitter lesson.
[12:33] Ashley: I want to make sure everyone follows that because the bitter lesson is a foundational concept in AI development. It was originally coined by researcher Rich Sutton. The Bitter Lesson states that raw compute power and learning from raw experience will always eventually beat human-encoded knowledge. Essentially, don't try to teach an AI how to play chess by giving it human strategies. Just give it the rules, give it massive computing power, and let it play itself a billion times until it learns strategies humans haven't even thought of. That is the traditional view. Okay.
[13:05] Ray: But she points out that modern LLMs are the exact opposite of that. How so? They are entirely built on human-generated knowledge. They were trained on the entire internet of human text, human logic, human reasoning, human conversations. Right. They don't learn from raw interactive experience in the physical or digital world. They just regurgitate our own cleverness back to us. So the moment you put them
[13:28] Ashley: in an environment where they can't rely on pre-existing human instructions, like that silent visual ARC, AGI 3 grid collapse.
[13:37] Ray: Exactly. They have no human text at all from. And they don't know how to learn from scratch.
[13:41] Ashley: This is why every SaaS company on the planet pivoting to sell you autonomous AI agents is a bit misleading. Just a bit. The ARC benchmark mathematically proves they aren't autonomous. They are traditional software architectures with an LLM taped to the middle. The human operator is still doing the real driving.
[14:01] Ray: Which leaves us with a very practical problem. Since True, Autonomous AGI isn't walking through the door anytime soon to save your Q4 pipeline, how do marketers and RevOps leaders actually operate right now?
[14:13] Ashley: Yeah, because we can't just plug an LLM into our CRM, say, go find leads and walk away. No, you have to play in
[14:19] Ray: what the researchers call the residual gap. The residual gap. Define that for me. It's the measurable space between human learning efficiency and current AI capabilities. As long as there is a gap between how fast a human can adapt to a new environment and how fast an AI can, we do not have AGI.
[14:35] Ashley: And right now, as those 0% test scores show, that gap is massive. Huge.
[14:40] Ray: This means B2B practitioners need to actively take on the heavy lifting that the AI simply cannot do. Right.
[14:48] Ashley: If you look at the four core capabilities tested in the ARC benchmark, they are exploration, modeling, goal setting, and planning.
[14:55] Ray: And your AI tools cannot do the goal setting or the modeling for you.
[15:00] Ashley: Because the AI cannot autonomously infer the mechanics of an unknown environment. So you, the human operator, must rigidly define
[15:09] Ray: everything. Right. You have to define exactly how to turn a raw intent signal, like a prospect visiting your pricing page or a target company raising a series B into an actionable outcome. You have to build the exact
[15:23] Ashley: sequence of events. You are essentially holding the AI's hand while it is still learning to walk.
[15:28] Ray: It's a fun. shift in expectations. I talked to a lot of RevOps teams who are incredibly frustrated because they expect the AI to behave in a way it simply isn't built for. They
[15:36] Ashley: think it's broken, but it's actually just unguided. Exactly.
[15:40] Ray: I actually have an analogy to offer you on that front. Lay it on me. Ask yourself. Are you treating your AI like a seasoned chief revenue officer who can figure out a new market on the fly? Or are you treating it like a super fast data entry clerk who needs explicit instructions for every single click? Oh, that's really good.
[15:58] Ashley: Treating it like the CRO is exactly why most AI sales plays fail right now. Right. The CRO has fluid intelligence. The CRO looks at a messy CRM, infers the company's goals without being told, builds a mental model of the pipeline. And executes a strategy. But
[16:14] Ray: the AI is the data entry clerk. It can process a million rows of data in three seconds, which a human could never do. Right. But if you don't tell it exactly what column to put the output in, it breaks. His incredible reasoning capabilities within verifiable domains where it has sufficient knowledge coverage, but it is entirely bound by the
[16:32] Ashley: scaffolding we provide. The human ability to rapidly adapt to understand the unknown unknowns of a shifting market is still entirely unique to us. And understanding that difference is your competitive advantage right now. The technical failure of these frontier models isn't just an abstract research problem for academics to debate. No, it directly
[16:50] Ray: dictates where you should be spending your time and budget today.
[16:53] Ashley: Stop waiting for a magical autonomous agent to solve your pipeline issues. Start investing heavily in the human engineering required to build the perfect harness for these incredibly powerful but ultimately dependent
[17:05] Ray: models. It is the human in the loop. that creates the actual value. But if we connect this to the bigger picture, it raises an incredibly important question for the future of our industry. Okay, what is it? If frontier models are essentially anti-bitter, lesson meaning they are built on human-generated knowledge rather than learning from raw experience, is the real moat in lead gen still the AI you use? Or is it the unique way you've learned to handhold an agent that still can't navigate an unknown environment on its own?
[17:38] Ashley: Wow. That completely flips the script on how we view competitive advantage in SaaS right now. It's not about the model. It's about the harness. We would love to hear how you are building your own scaffolding and whether you think those workflows are your new proprietary IP. Come share your thoughts and continue the conversation with us over at podcast7.ai. Thanks for joining us for this deep dive. See you next time.
Return to Archive