← Back to Episodes

Beyond Intent Signals Simulating The Brutal Truth

Published: 15 June 2026

[00:00]
Ashley: Hi, I'm Ashley.
Ray: And I am Ray.
Ashley: Welcome to Podcast 7. Today we're getting into a breakthrough that
honestly sounds like something straight out of a corporate sci-fi thriller, but
it has immediate real-world implications for how you build your go-to-market
strategy.
Ray: Oh, absolutely. So here's the hook.
Ashley: A major toothpaste company, of all things, published a research paper
that could quietly and permanently kill the entire multibillion-dollar market
research industry. I mean, it's quite the headline, right?
Ray: We're talking about a joint research paper from PyMC Labs and the
Colgate-Palmolive company. Right. And to really understand why a toothpaste
manufacturer's research matters to you, especially if you're a B2B marketer, a
product manager, or a GTM engineer, we have to talk about this massive funding.
Yeah.
Ashley: In the B2B sales world, we literally just call it the polite pass. I
mean, it is the absolute bane of any go-to-market team's existence. You've
probably lived this, right? You get a prospect on a demo, you spend an hour
walking them through your enterprise SaaS platform, and they're nodding the
whole time.
Ray: Oh, totally. They say the user interface is intuitive. They compliment your
analytics.
Ashley: Exactly. They give you a four out of five on your post-demo survey, and
then they completely ghost your sales team.
Ray: Right, because when you survey humans, they're often too polite or rushed
or honestly just too apathetic to give you the brutally honest truth to your
face. They smile. They nod and they walk away.
Ashley: Which creates this terrible feedback loop. Companies are spending
billions globally on focus groups and surveys, but the data is just incredibly
noisy. It's totally inflated by politeness.
Ray: Exactly. So what this PyMC Labs and Colgate research explores is how we can
use AI to completely bypass that flawed human loop. They're engineering
synthetic consumers that are, well, entirely immune to being polite. They do not
care about your feelings. They will give you the unvarnished truth.
[02:05]
Ashley: But to appreciate how revolutionary this fix is, we kind of have
to understand why early attempts at using AI for market research basically
face-planted. Because researchers have been trying to use large language models
for this for a while now.
Ray: Oh, yeah. Naturally, when models like GPT-4 came along, people thought,
great. We don't need human focus groups anymore. We'll just feed the AI our
product concept and ask it to rate it on a one to five scale.
Ashley: It makes logical sense on the surface, right? You have this vastly
intelligent model. It read the whole Internet. So just ask for a number.
Ray: Yeah, it's called a direct Likert rating. But the data in this paper shows
exactly what happens when you do that. The AI produces these incredibly
unrealistic, overly narrow distributions. It completely fails to match the human
tendency to, you know, hand out strong opinions. It just plays it incredibly
safe.
Ashley: Exactly.
Ray: Instead of definitive fours or fives or brutal ones, the models heavily
cluster their answers right around a neutral three.
Ashley: I love this detail because it completely shatters the illusion of the
all-knowing, confident AI. When you directly prompt an LLM for a rating, it
doesn't act like a decisive buyer. It acts exactly like a terrified corporate
intern.
Ray: Yes.
Ashley: You know, an intern who is desperately afraid of having a real opinion,
so they just circle average on every single survey to avoid getting yelled at by
the boss.
Ray: It's actually worse than an intern, really. Because an intern is driven by
fear, whereas the AI is mathematically designed to optimize for the safest
statistical probability.
Ashley: Right, because it's just predicting the next token.
Ray: Exactly. When confronted with a direct numerical choice without context,
picking the median, a three out of five, minimizes the statistical penalty. It
regresses to the mean.
Ashley: So the average might look OK, but the actual distribution is completely
wrong.
Ray: Yeah. Real humans have messy, varied opinions. The directly prompted AI is
just mathematically bland.
[04:00]
Ashley: OK, I'm stuck here. If asking an AI for a simple number gives us
middle-of-the-road, mathematically bland garbage, how do we actually get the
truth? I mean, a paragraph of raw text from an AI role-playing as a buyer is
nice for a brainstorm. Sure. But it seems completely useless to a data
scientist. You can't put a rambling paragraph into a predictive spreadsheet. How
did the researchers fix the math?
Ray: That is the core innovation of this paper. They realized that to get the
truth, you have to let the AI do what it does best: generate language. You ask
it to talk instead of asking it to count. We call it semantic similarity rating,
or SSR. Instead of forcing the LLM to pick a number, they give the model a
specific demographic profile, show it the product, and just ask the AI to write
down its raw, unfiltered thoughts in free text.
Ashley: But to solve the scaling problem, how to turn text into data, they rely
on embeddings, right? Let's unpack embeddings for the GTM engineers listening
who might be dealing with semantic search.
Ray: Sure. They take that free text response and run it through OpenAI's
text-embedding-3-small model. An embedding model essentially translates concepts
and sentences into a coordinate in a massive multidimensional mathematical
space.
Ashley: So if we picture this as a physical space, like a giant 3D room, if the
AI writes a glowing review, it gets plotted in the top right corner. If it
writes a terrible review, it goes to the bottom left.
Ray: That's a really great way to visualize it. The review literally becomes a
specific point on a map.
Ashley: Exactly.
Ray: And once you have that point on the map, the researchers calculate
something called cosine similarity between that point and a set of predefined
anchor statements.
Ashley: And these anchor statements are like the reference points on the map
that the researchers placed there beforehand.
Ray: Yes. You place an anchor statement for a one rating, which might be the
phrase, "I would definitely not buy this." And an anchor for a five rating like,
"I am definitely buying this." By measuring the mathematical distance, the
cosine similarity between the AI's raw thoughts and those anchors, they project
the text onto a five-point scale. It turns qualitative rambling into a highly
precise numerical probability distribution.
[06:11]
Ashley: Let's apply this directly to your world as a listener, because
this is where the strategy actually comes in. The entire success of this
semantic similarity rating depends completely on how you design those reference
statements.
Ray: Oh, absolutely. The AI is doing the talking, but the human has to set the
boundaries of the semantic map. You can't just throw this at an LLM and walk
away. GTM teams have to do the expert work of writing the exact Likert anchors
for their specific B2B niche.
Ashley: Because, I mean, a five out of five for a cybersecurity SaaS platform
sounds very different than a five for toothpaste.
Ray: That is the real engineering challenge. You're manually calibrating the
scale. You have to know the language of your buyers well enough to write the
anchors. But when you do that correctly, the results are frankly astonishing.
Ashley: Yeah, because a clever math trick is totally useless if it doesn't match
reality. So how do they test this?
Ray: PyMC Labs and Colgate put SSR to the ultimate test. They ran it against an
extensive data set of 57 real corporate surveys, encompassing 9,300 actual human
responses.
Ashley: And the success rate here is what makes this a multibillion-dollar
disruption. The synthetic AI consumers matched real human buying behavior with a
staggering 90% correlation attainment. It's massive. That is basically
test-retest reliability. If you surveyed the exact same humans twice, you'd get
about a 90% match. The AI literally hit the theoretical ceiling of how accurate
a survey can even be.
Ray: And beyond just the correlation, we have to look at the shape of the data.
They use something called the Kolmogorov-Smirnov similarity score, or KS score,
which was over 0.85.
[07:46]
Ashley: Wait, let's pause on the Kolmogorov-Smirnov score. For the sales
leaders and GTM engineers listening, what does that actually mean visually? Are
we talking about a bell curve?
Ray: It's a crucial distinction. A correlation just tells you if two things move
in the same direction. But a KS score measures the maximum distance between two
cumulative distribution functions.
Ashley: Okay, meaning what?
Ray: Visually, imagine you draw a line graph showing exactly how those 9,300
humans voted. Maybe a lot of ones. A few twos, a dip at three, and a huge spike
at five. Looks like a messy, jagged mountain range.
Ashley: Right.
Ray: The KS score proves that the AI's votes drew the exact same jagged mountain
range. The AI didn't just guess the average correctly. It perfectly mimicked the
messy, polarized spread of human opinions.
[08:32]
Ashley: That is incredible. But there's a fascinating caveat in the
research, and it's vital if you're trying to implement this. The model
successfully mirrored how different demographics react, right? Like intent to
buy dropping for younger and older groups, but spiking for middle-aged cohorts.
Ray: Yes, it replicated that beautifully. But when the researchers ran the exact
same experiment and removed the demographic prompts, when they didn't tell the
AI who it was supposed to be, the correlation plummeted from 90% down to
just 50%. It basically became a coin toss. Without a specific persona
constraint, the AI rated products much more positively across the board. It just
became a generic yes-man.
Ashley: But hold on, if it drops to a 50% accuracy just because you took away
the persona, doesn't that sort of invalidate the AI's actual intelligence here?
Ray: How so?
Ashley: Well, it sounds like the AI doesn't actually know if the product is
fundamentally good or bad. It's just role-playing a stereotype. If I ask it to
be a wealthy CFO, it acts stingy. If I ask it to be a millennial manager, it
loves shiny features. Is it just regurgitating stereotypes?
Ray: That's a really valid philosophical pushback. But in market research, value
is entirely subjective. There's no such thing as a fundamentally good product in
a vacuum. A product is only good in the context of the specific buyer evaluating
it.
Ashley: That makes sense.
Ray: So yes, the AI is leveraging its vast statistical knowledge of how a
wealthy CFO behaves, but that is exactly what you want it to do. It proves you
cannot just use a generic prompt. You must constrain the AI with a highly
specific identity to get a signal that maps to reality.
[10:04]
Ashley: Okay, got it. The context is the product, which naturally brings
up a question any data scientist listening right now is probably screaming at
their dashboard. Let me guess.
Ray: Why use an LLM at all?
Ashley: Exactly. If we have all this demographic data—age, income, role—and we
have structured product features, why not just train a traditional machine
learning algorithm on historical survey data and let it predict the outcomes?
Ray: It's the classic debate happening everywhere right now: generative AI
versus traditional machine learning. And the researchers actually tested this
head-to-head. They trained 300 LightGBM classifiers, which are gradient-boosted
decision trees, to see if traditional machine learning could beat the LLM.
Ashley: Let's define gradient-boosted decision trees for a second. This is an
algorithm that looks at columns in a spreadsheet, right? It creates logical
splits, like if the price is over $50 and the buyer is under 30, then the
probability of purchase drops by 10%.
Ray: Exactly. Highly logical, highly structured. And the comparison is striking.
Even with access to in-sample training data from the actual historical surveys,
the LightGBM models only achieved a 65% correlation. Contrast that with the SSR
method using an LLM, which achieved 88% in this specific test. But the crucial
mind-blowing detail here is that the LLM achieved 88% with zero-shot
elicitation.
Ashley: Meaning the language model saw absolutely zero previous training data
from those specific surveys. It wasn't trained on the spreadsheet of past
answers. It went entirely blind, looked at the product concept, and still
completely crushed the decision tree algorithm that literally had the answer
key.
Ray: Yes. And we have to look at why. Why does a zero-shot language model beat a
trained decision tree? It comes down to how they process unstructured
information. The traditional model only processed coarse-grained structured
data: price, tiers, labels. But the LLM processed the entire holistic concept,
the specific product descriptions, the positioning statements, the visual
elements, the nuanced feature lists.
Ashley: It leverages its vast semantic understanding of how human beings talk
about those things. A decision tree knows the price went up. An LLM knows that
the specific phrasing of the marketing copy sounds condescending or that a
feature solves an emotional pain point. It understands the vibe, not just the
spreadsheet.
Ray: Precisely. That semantic understanding is exactly what allows GTM engineers
to build truly predictive models today. You're moving beyond structured data and
analyzing sentiment and psychological friction.
[12:31]
Ashley: But the bottleneck here isn't the AI's capability, it's the
infrastructure. To actually map these semantic spaces at scale requires
specialized tools, which is exactly what our sponsors focus on. If you're
building out these kinds of workflows, you really should look at [SPONSOR]
Demand 7 at demand7.ai. That's where AI meets demand generation.
Ray: Yeah. And for the technical infrastructure side, [SPONSOR] GTM 7 at gtm7.ai
is where GTM engineering meets AI execution. They're building the exact systems
that make this kind of semantic analysis possible at an enterprise scale.
Ashley: Highly recommend looking into both of those, because applying this in
the real world, moving it from a research paper into your revenue pipeline, is
where the true value lies. We've seen how this works for consumer goods, but how
does it solve the B2B marketer's biggest nightmare?
Ray: Let's translate this into enterprise B2B strategy, because if you sell
complex software, human buyers are incredibly frustrating. Like we talked about
at the top of the show, they suffer from massive positivity bias.
Ashley: Oh, constantly. They smile through the pitch, nod at your pricing slide,
and then politely pass and never answer your emails again. The feedback loop is
totally broken because people in corporate settings generally hate delivering
harsh criticism to your face.
Ray: But synthetic consumers don't care about professional courtesy. The
research explicitly found that LLMs prompted through the SSR method are far less
prone to positivity bias. They provide a much wider dynamic range of purchase
intent. They are actually willing to give you the hard no.
[14:01]
Ashley: And in B2B, that hard no is the single most valuable piece of
data a product marketer can get. It's what the researchers call the
discriminative signal, right?
Ray: Exactly. And you can use this discriminative signal to identify incredibly
specific friction points before you ever go to market. For example, the LLMs
reacted very sensitively to income constraints. When prompted with a persona in
danger of budget issues, its purchase intent plummeted immediately. It panicked.
Ashley: Just like a real buyer would. Think about how you can weaponize that in
your go-to-market strategy. You can use this to A/B test pricing tiers and
uncover exactly where a buyer's budget friction point lies.
Ray: Without waiting for a live pilot to fail.
Ashley: Exactly. You don't have to lose a six-month deal cycle just to find out
your implementation fee was too high. You spin up 50 synthetic CFOs, run your
new pricing model, and get detailed qualitative feedback mapped to quantitative
data.
Ray: And because it's free text, translated into math, the feedback is far
deeper than what actual humans would bother to write in a text box. A human
writes, "price too high." The synthetic CFO writes a three-paragraph explanation
of why the fee disrupts their Q3 OpEx budget.
Ashley: It completely shifts the timeline of concept testing. You're moving from
lagging indicators, finding out why you lost after the fact, to leading
indicators. You get the explanation for the rejection before you've even
finalized the pitch deck. Which fundamentally changes your role.
Ray: You're iterating your messaging against an unforgiving, mathematically
precise, synthetic audience until the friction is gone.
[15:35]
Ashley: So I want to leave you with a final thought to mull over as you
look at your own pipeline. We talk a lot about intent signals in our industry,
right? Tracking who visited your pricing page.
Ray: Right. Intent signals.
Ashley: Well, intent signals tell you a prospect is looking. But intent
simulation, what we've been talking about today, tells you exactly why they're
eventually going to say no. Which one of those is actually more valuable to your
sales ops leader? That is the real question. Should every single B2B pilot or
major feature release be blessed by a synthetic critic first? To tell you the
brutal truth before you waste resources on a live launch.
Ray: It's a question that every GTM team is going to have to answer very soon
because your competitors are absolutely going to start simulating your buyers.
We'd love to hear how you're thinking about integrating synthetic feedback into
your workflows. Come continue the conversation with us over at podcast7.ai.
Ashley: Until next time, keep engineering those outcomes.
Return to Archive