← Back to Episodes

Beyond Intent Signals Simulating The Brutal Truth

Published: 15 June 2026

[00:00]

Ashley: Hi, I'm Ashley.

Ray: And I am Ray.

Ashley: Welcome to Podcast 7. Today we're getting into a breakthrough that

honestly sounds like something straight out of a corporate sci-fi thriller, but

it has immediate real-world implications for how you build your go-to-market

strategy.

Ray: Oh, absolutely. So here's the hook.

Ashley: A major toothpaste company, of all things, published a research paper

that could quietly and permanently kill the entire multibillion-dollar market

research industry. I mean, it's quite the headline, right?

Ray: We're talking about a joint research paper from PyMC Labs and the

Colgate-Palmolive company. Right. And to really understand why a toothpaste

manufacturer's research matters to you, especially if you're a B2B marketer, a

product manager, or a GTM engineer, we have to talk about this massive funding.

Yeah.

Ashley: In the B2B sales world, we literally just call it the polite pass. I

mean, it is the absolute bane of any go-to-market team's existence. You've

probably lived this, right? You get a prospect on a demo, you spend an hour

walking them through your enterprise SaaS platform, and they're nodding the

whole time.

Ray: Oh, totally. They say the user interface is intuitive. They compliment your

analytics.

Ashley: Exactly. They give you a four out of five on your post-demo survey, and

then they completely ghost your sales team.

Ray: Right, because when you survey humans, they're often too polite or rushed

or honestly just too apathetic to give you the brutally honest truth to your

face. They smile. They nod and they walk away.

Ashley: Which creates this terrible feedback loop. Companies are spending

billions globally on focus groups and surveys, but the data is just incredibly

noisy. It's totally inflated by politeness.

Ray: Exactly. So what this PyMC Labs and Colgate research explores is how we can

use AI to completely bypass that flawed human loop. They're engineering

synthetic consumers that are, well, entirely immune to being polite. They do not

care about your feelings. They will give you the unvarnished truth.

[02:05]

Ashley: But to appreciate how revolutionary this fix is, we kind of have

to understand why early attempts at using AI for market research basically

face-planted. Because researchers have been trying to use large language models

for this for a while now.

Ray: Oh, yeah. Naturally, when models like GPT-4 came along, people thought,

great. We don't need human focus groups anymore. We'll just feed the AI our

product concept and ask it to rate it on a one to five scale.

Ashley: It makes logical sense on the surface, right? You have this vastly

intelligent model. It read the whole Internet. So just ask for a number.

Ray: Yeah, it's called a direct Likert rating. But the data in this paper shows

exactly what happens when you do that. The AI produces these incredibly

unrealistic, overly narrow distributions. It completely fails to match the human

tendency to, you know, hand out strong opinions. It just plays it incredibly

safe.

Ashley: Exactly.

Ray: Instead of definitive fours or fives or brutal ones, the models heavily

cluster their answers right around a neutral three.

Ashley: I love this detail because it completely shatters the illusion of the

all-knowing, confident AI. When you directly prompt an LLM for a rating, it

doesn't act like a decisive buyer. It acts exactly like a terrified corporate

intern.

Ray: Yes.

Ashley: You know, an intern who is desperately afraid of having a real opinion,

so they just circle average on every single survey to avoid getting yelled at by

the boss.

Ray: It's actually worse than an intern, really. Because an intern is driven by

fear, whereas the AI is mathematically designed to optimize for the safest

statistical probability.

Ashley: Right, because it's just predicting the next token.

Ray: Exactly. When confronted with a direct numerical choice without context,

picking the median, a three out of five, minimizes the statistical penalty. It

regresses to the mean.

Ashley: So the average might look OK, but the actual distribution is completely

wrong.

Ray: Yeah. Real humans have messy, varied opinions. The directly prompted AI is

just mathematically bland.

[04:00]

Ashley: OK, I'm stuck here. If asking an AI for a simple number gives us

middle-of-the-road, mathematically bland garbage, how do we actually get the

truth? I mean, a paragraph of raw text from an AI role-playing as a buyer is

nice for a brainstorm. Sure. But it seems completely useless to a data

scientist. You can't put a rambling paragraph into a predictive spreadsheet. How

did the researchers fix the math?

Ray: That is the core innovation of this paper. They realized that to get the

truth, you have to let the AI do what it does best: generate language. You ask

it to talk instead of asking it to count. We call it semantic similarity rating,

or SSR. Instead of forcing the LLM to pick a number, they give the model a

specific demographic profile, show it the product, and just ask the AI to write

down its raw, unfiltered thoughts in free text.

Ashley: But to solve the scaling problem, how to turn text into data, they rely

on embeddings, right? Let's unpack embeddings for the GTM engineers listening

who might be dealing with semantic search.

Ray: Sure. They take that free text response and run it through OpenAI's

text-embedding-3-small model. An embedding model essentially translates concepts

and sentences into a coordinate in a massive multidimensional mathematical

space.

Ashley: So if we picture this as a physical space, like a giant 3D room, if the

AI writes a glowing review, it gets plotted in the top right corner. If it

writes a terrible review, it goes to the bottom left.

Ray: That's a really great way to visualize it. The review literally becomes a

specific point on a map.

Ashley: Exactly.

Ray: And once you have that point on the map, the researchers calculate

something called cosine similarity between that point and a set of predefined

anchor statements.

Ashley: And these anchor statements are like the reference points on the map

that the researchers placed there beforehand.

Ray: Yes. You place an anchor statement for a one rating, which might be the

phrase, "I would definitely not buy this." And an anchor for a five rating like,

"I am definitely buying this." By measuring the mathematical distance, the

cosine similarity between the AI's raw thoughts and those anchors, they project

the text onto a five-point scale. It turns qualitative rambling into a highly

precise numerical probability distribution.

[06:11]

Ashley: Let's apply this directly to your world as a listener, because

this is where the strategy actually comes in. The entire success of this

semantic similarity rating depends completely on how you design those reference

statements.

Ray: Oh, absolutely. The AI is doing the talking, but the human has to set the

boundaries of the semantic map. You can't just throw this at an LLM and walk

away. GTM teams have to do the expert work of writing the exact Likert anchors

for their specific B2B niche.

Ashley: Because, I mean, a five out of five for a cybersecurity SaaS platform

sounds very different than a five for toothpaste.

Ray: That is the real engineering challenge. You're manually calibrating the

scale. You have to know the language of your buyers well enough to write the

anchors. But when you do that correctly, the results are frankly astonishing.

Ashley: Yeah, because a clever math trick is totally useless if it doesn't match

reality. So how do they test this?

Ray: PyMC Labs and Colgate put SSR to the ultimate test. They ran it against an

extensive data set of 57 real corporate surveys, encompassing 9,300 actual human

responses.

Ashley: And the success rate here is what makes this a multibillion-dollar

disruption. The synthetic AI consumers matched real human buying behavior with a

staggering 90% correlation attainment. It's massive. That is basically

test-retest reliability. If you surveyed the exact same humans twice, you'd get

about a 90% match. The AI literally hit the theoretical ceiling of how accurate

a survey can even be.

Ray: And beyond just the correlation, we have to look at the shape of the data.

They use something called the Kolmogorov-Smirnov similarity score, or KS score,

which was over 0.85.

[07:46]

Ashley: Wait, let's pause on the Kolmogorov-Smirnov score. For the sales

leaders and GTM engineers listening, what does that actually mean visually? Are

we talking about a bell curve?

Ray: It's a crucial distinction. A correlation just tells you if two things move

in the same direction. But a KS score measures the maximum distance between two

cumulative distribution functions.

Ashley: Okay, meaning what?

Ray: Visually, imagine you draw a line graph showing exactly how those 9,300

humans voted. Maybe a lot of ones. A few twos, a dip at three, and a huge spike

at five. Looks like a messy, jagged mountain range.

Ashley: Right.

Ray: The KS score proves that the AI's votes drew the exact same jagged mountain

range. The AI didn't just guess the average correctly. It perfectly mimicked the

messy, polarized spread of human opinions.

[08:32]

Ashley: That is incredible. But there's a fascinating caveat in the

research, and it's vital if you're trying to implement this. The model

successfully mirrored how different demographics react, right? Like intent to

buy dropping for younger and older groups, but spiking for middle-aged cohorts.

Ray: Yes, it replicated that beautifully. But when the researchers ran the exact

same experiment and removed the demographic prompts, when they didn't tell the

AI who it was supposed to be, the correlation plummeted from 90% down to

just 50%. It basically became a coin toss. Without a specific persona

constraint, the AI rated products much more positively across the board. It just

became a generic yes-man.

Ashley: But hold on, if it drops to a 50% accuracy just because you took away

the persona, doesn't that sort of invalidate the AI's actual intelligence here?

Ray: How so?

Ashley: Well, it sounds like the AI doesn't actually know if the product is

fundamentally good or bad. It's just role-playing a stereotype. If I ask it to

be a wealthy CFO, it acts stingy. If I ask it to be a millennial manager, it

loves shiny features. Is it just regurgitating stereotypes?

Ray: That's a really valid philosophical pushback. But in market research, value

is entirely subjective. There's no such thing as a fundamentally good product in

a vacuum. A product is only good in the context of the specific buyer evaluating

it.

Ashley: That makes sense.

Ray: So yes, the AI is leveraging its vast statistical knowledge of how a

wealthy CFO behaves, but that is exactly what you want it to do. It proves you

cannot just use a generic prompt. You must constrain the AI with a highly

specific identity to get a signal that maps to reality.

[10:04]

Ashley: Okay, got it. The context is the product, which naturally brings

up a question any data scientist listening right now is probably screaming at

their dashboard. Let me guess.

Ray: Why use an LLM at all?

Ashley: Exactly. If we have all this demographic data—age, income, role—and we

have structured product features, why not just train a traditional machine

learning algorithm on historical survey data and let it predict the outcomes?

Ray: It's the classic debate happening everywhere right now: generative AI

versus traditional machine learning. And the researchers actually tested this

head-to-head. They trained 300 LightGBM classifiers, which are gradient-boosted

decision trees, to see if traditional machine learning could beat the LLM.

Ashley: Let's define gradient-boosted decision trees for a second. This is an

algorithm that looks at columns in a spreadsheet, right? It creates logical

splits, like if the price is over $50 and the buyer is under 30, then the

probability of purchase drops by 10%.

Ray: Exactly. Highly logical, highly structured. And the comparison is striking.

Even with access to in-sample training data from the actual historical surveys,

the LightGBM models only achieved a 65% correlation. Contrast that with the SSR

method using an LLM, which achieved 88% in this specific test. But the crucial

mind-blowing detail here is that the LLM achieved 88% with zero-shot

elicitation.

Ashley: Meaning the language model saw absolutely zero previous training data

from those specific surveys. It wasn't trained on the spreadsheet of past

answers. It went entirely blind, looked at the product concept, and still

completely crushed the decision tree algorithm that literally had the answer

key.

Ray: Yes. And we have to look at why. Why does a zero-shot language model beat a

trained decision tree? It comes down to how they process unstructured

information. The traditional model only processed coarse-grained structured

data: price, tiers, labels. But the LLM processed the entire holistic concept,

the specific product descriptions, the positioning statements, the visual

elements, the nuanced feature lists.

Ashley: It leverages its vast semantic understanding of how human beings talk

about those things. A decision tree knows the price went up. An LLM knows that

the specific phrasing of the marketing copy sounds condescending or that a

feature solves an emotional pain point. It understands the vibe, not just the

spreadsheet.

Ray: Precisely. That semantic understanding is exactly what allows GTM engineers

to build truly predictive models today. You're moving beyond structured data and

analyzing sentiment and psychological friction.

[12:31]

Ashley: But the bottleneck here isn't the AI's capability, it's the

infrastructure. To actually map these semantic spaces at scale requires

specialized tools, which is exactly what our sponsors focus on. If you're

building out these kinds of workflows, you really should look at [SPONSOR]

Demand 7 at demand7.ai. That's where AI meets demand generation.

Ray: Yeah. And for the technical infrastructure side, [SPONSOR] GTM 7 at gtm7.ai

is where GTM engineering meets AI execution. They're building the exact systems

that make this kind of semantic analysis possible at an enterprise scale.

Ashley: Highly recommend looking into both of those, because applying this in

the real world, moving it from a research paper into your revenue pipeline, is

where the true value lies. We've seen how this works for consumer goods, but how

does it solve the B2B marketer's biggest nightmare?

Ray: Let's translate this into enterprise B2B strategy, because if you sell

complex software, human buyers are incredibly frustrating. Like we talked about

at the top of the show, they suffer from massive positivity bias.

Ashley: Oh, constantly. They smile through the pitch, nod at your pricing slide,

and then politely pass and never answer your emails again. The feedback loop is

totally broken because people in corporate settings generally hate delivering

harsh criticism to your face.

Ray: But synthetic consumers don't care about professional courtesy. The

research explicitly found that LLMs prompted through the SSR method are far less

prone to positivity bias. They provide a much wider dynamic range of purchase

intent. They are actually willing to give you the hard no.

[14:01]

Ashley: And in B2B, that hard no is the single most valuable piece of

data a product marketer can get. It's what the researchers call the

discriminative signal, right?

Ray: Exactly. And you can use this discriminative signal to identify incredibly

specific friction points before you ever go to market. For example, the LLMs

reacted very sensitively to income constraints. When prompted with a persona in

danger of budget issues, its purchase intent plummeted immediately. It panicked.

Ashley: Just like a real buyer would. Think about how you can weaponize that in

your go-to-market strategy. You can use this to A/B test pricing tiers and

uncover exactly where a buyer's budget friction point lies.

Ray: Without waiting for a live pilot to fail.

Ashley: Exactly. You don't have to lose a six-month deal cycle just to find out

your implementation fee was too high. You spin up 50 synthetic CFOs, run your

new pricing model, and get detailed qualitative feedback mapped to quantitative

data.

Ray: And because it's free text, translated into math, the feedback is far

deeper than what actual humans would bother to write in a text box. A human

writes, "price too high." The synthetic CFO writes a three-paragraph explanation

of why the fee disrupts their Q3 OpEx budget.

Ashley: It completely shifts the timeline of concept testing. You're moving from

lagging indicators, finding out why you lost after the fact, to leading

indicators. You get the explanation for the rejection before you've even

finalized the pitch deck. Which fundamentally changes your role.

Ray: You're iterating your messaging against an unforgiving, mathematically

precise, synthetic audience until the friction is gone.

[15:35]

Ashley: So I want to leave you with a final thought to mull over as you

look at your own pipeline. We talk a lot about intent signals in our industry,

right? Tracking who visited your pricing page.

Ray: Right. Intent signals.

Ashley: Well, intent signals tell you a prospect is looking. But intent

simulation, what we've been talking about today, tells you exactly why they're

eventually going to say no. Which one of those is actually more valuable to your

sales ops leader? That is the real question. Should every single B2B pilot or

major feature release be blessed by a synthetic critic first? To tell you the

brutal truth before you waste resources on a live launch.

Ray: It's a question that every GTM team is going to have to answer very soon

because your competitors are absolutely going to start simulating your buyers.

We'd love to hear how you're thinking about integrating synthetic feedback into

your workflows. Come continue the conversation with us over at podcast7.ai.

Ashley: Until next time, keep engineering those outcomes.

Return to Archive