News

GEO Tooling Under the Hood: Why "Vibe Coded" Prototypes Are Just the Beginning

Ad agencies, like Havas, Broadhead and Supergood, are vibe-coding their own generative engine optimization (GEO) tools on top of large language models—often in a matter of hours.

The marketing industry has a new party trick: building brand monitoring tools in an evening. Agencies are chest-thumping about GEO dashboards assembled in two hours using Claude Code and Replit, and the trade press is duly impressed. But if you're a marketing or data leader deciding whether to spin up your own generative engine optimization infrastructure, the build-vs-buy conversation starts after the prototype demo ends—and that's where most agencies aren't being fully transparent about the complexity involved.

The Adweek piece on agencies "vibe coding" GEO products is a useful signal about where the market is heading. But it conflates proof-of-concept speed with production readiness in ways that could lead teams to make expensive miscalculations. Here's what the actual infrastructure requires, and how to think about the ROI tradeoffs before you commit.

What GEO Tooling Actually Requires Under the Hood

Broadhead's Mitch Hislop built his agency's first GEO monitoring platform in a single evening. That's a genuine achievement—and also a slightly misleading benchmark. What he built was a query runner with competitive ranking logic. What a production GEO system actually requires is substantially more complex across three distinct layers.

Data pipelines. A viable GEO tool doesn't just fire prompts at one model. Havas's Brand Insights AI runs queries across multiple LLMs, covering 100 countries and 60 languages. Building the ingestion layer to handle that at scale—normalizing outputs from GPT-4o, Claude, Gemini, and Perplexity simultaneously—requires real data engineering. You need deduplication logic, output schema standardization, and storage architecture that doesn't collapse when you're running thousands of brand queries per day. This isn't a Claude Code afternoon project; it's a data infrastructure problem.

Prompt engineering and maintenance. The quality of a GEO signal is only as good as the prompts generating it. Havas's system generates prompts based on a client's brand profile and simulates how those brands appear in AI-driven discovery. That sounds elegant until you consider prompt drift—the reality that the same prompt produces meaningfully different outputs as underlying models are updated. Production GEO systems need prompt versioning, A/B testing frameworks, and regression testing pipelines to catch when a model update changes your brand visibility metrics in ways that aren't real signal. Without this, you're measuring noise.

LLM evaluation loops. The most technically sophisticated element described in the Adweek piece comes from Supergood: a self-evaluating model pipeline where an LLM generates a response, scores it against predefined criteria, and iterates until it hits a quality threshold—without human intervention. This is an actual agentic architecture, and it's meaningfully harder to build and maintain than a query-and-display prototype. Defining the evaluation rubric, calibrating the scoring thresholds, and preventing the system from gaming its own metrics are non-trivial ML engineering problems. When Supergood's Mike Barrett says "everybody's making software right now," he's describing a world where this kind of infrastructure becomes table stakes—not a differentiator you can build on a Tuesday evening.

The Real Build-vs-Buy Calculus

Havas deliberately passed on a multi-million dollar Anthropic enterprise agreement, citing flexibility and uneven adoption across teams. That's a sound financial decision for a holding company managing diverse client needs. But it surfaces the core tension in the build-vs-buy equation: control costs money too, it just bills differently.

When you build in-house, you're trading licensing fees for engineering time, infrastructure costs, and ongoing maintenance overhead. For Broadhead, the math is straightforward—they don't want to pay for SEMrush and Profound when they can build something tailored to their workflow. That's rational for an agency that has technical product leadership (Hislop's VP of Product Innovation role is meaningful here) and clients who need differentiated competitive intelligence features.

But consider what the off-the-shelf GEO vendors—Profound, Bluefish, Emberos—are building toward: multi-model normalization, prompt stability across model updates, longitudinal tracking, and integrations with existing marketing stacks. Startups competing for Series A funding are staffing ML engineers to solve exactly the data pipeline and evaluation loop problems described above. The question isn't whether you can build a GEO prototype faster than you can evaluate a vendor. You almost certainly can. The question is whether your team's time is better spent building GEO infrastructure or using GEO insights to improve client outcomes.

For most brand-side marketing teams, the answer tilts toward buy—or at minimum, buy-and-extend. For agencies with product ambitions and the technical chops to maintain what they build, the calculus shifts toward build, with clear eyes about the engineering commitment required.

Actionable Takeaways for Marketing and Data Teams

Before you spin up a Claude Code session to build your own GEO tool, pressure-test the decision against these considerations:

  • Audit your actual prompt coverage. How many queries would represent a statistically meaningful sample of AI-generated discovery in your category? If it's more than a few hundred per week across multiple models, you need data infrastructure—not just a script.
  • Define your evaluation criteria before you build. What does "brand visibility improvement" mean in your AI-generated results? If you can't write a scoring rubric before you start building, you'll be measuring outputs you can't interpret. This is the step most vibe-coded prototypes skip.
  • Run a 30-day vendor pilot before committing to in-house build. The GEO vendor market is moving fast. Profound, in particular, is building toward multi-model normalization and longitudinal tracking that would take months to replicate. Run a structured evaluation with real client data before concluding the off-the-shelf options don't fit.
  • Separate monitoring from optimization. Most current GEO tools—built or bought—are primarily monitoring products. They tell you how a brand appears in AI responses. The harder, more valuable problem is optimization: structuring content, knowledge graphs, and entity associations to improve how LLMs represent a brand. If you're building, focus your engineering investment on the optimization layer, not just the dashboard.
  • Version your prompts from day one. Whether you build or buy, insist on prompt versioning. Model updates from Anthropic, OpenAI, and Google will affect your brand visibility metrics in ways that look like business changes but are actually measurement artifacts. You need the ability to rerun historical queries on the same prompt version.
  • Calculate your true cost of build. Take your engineering hourly rate, multiply by realistic maintenance hours (not just build hours), add infrastructure costs, and compare to annual vendor pricing. Most teams underestimate maintenance by 3-5x.

The Market Is Still Immature—Which Cuts Both Ways

The GEO space is genuinely early. Measurement standards don't exist. The vendors are pre-scale. The LLMs themselves are changing fast enough that a competitive intelligence feature built today may need fundamental rearchitecting in six months—which is precisely why Havas avoided locking into an enterprise agreement with a single provider.

That immaturity creates both opportunity and risk. Agencies that build now get proprietary tooling and the pitch differentiation that comes with it—Havas's Brand Insights AI is reportedly winning new business. But they're also committing engineering resources to a space where the underlying infrastructure (model capabilities, pricing, API stability) is still being negotiated at the provider level.

The teams that will extract the most value from GEO over the next 18 months aren't necessarily the ones who build the fastest prototypes. They're the ones who invest in the evaluation frameworks, data architectures, and prompt engineering discipline that turn a two-hour vibe-coded demo into a system that produces reliable, actionable signal at scale. Build for the demo if you need to win the room. But budget for the infrastructure if you intend to win the market.