Episode 3: Lab Wars are heating up

Welcome to The Silicon Diet — your digest on the latest happenings in AI, fundraises in the Bay Area, and insights into a variety of AI tools.

In Episode 3, Abhirup and Adi are back after a couple of weeks away — and the AI world did what it always does: ship an overwhelming number of updates in a tiny amount of time. The core theme of this episode is the frontier-lab “arms race” in coding + tool use, and what that means for builders trying to actually get repeatable work done with agents.

Then, the conversation pivots from models to real-world distribution: Adi’s physical product is unexpectedly popping off on Etsy, and that becomes a springboard into chat-based shopping assistants, trust, and why brands may need to start thinking about “SEO 2.0” (LLM / generative engine optimization).

About Your Hosts

Abhirup — Co-founder and Head of Innovation at Sainapse (AI customer support). Based in SF.
Adi — Builder and product person shipping new projects (including ModelMe). Based in SF.

Lab Wars are heating up: coding models, tool search, and the future of shopping

December 11, 2025

TL;DR

The frontier labs all shipped new heavy-hitters (Gemini 3, Claude Opus 4.5, GPT‑5.1), and the battlefield is coding + tool use.
“Long-running” agentic work is the new bar: not just writing code, but refactoring, testing, iterating, and staying coherent for hours (or longer).
Tooling is becoming the differentiator: MCPs and tool catalogs are exploding, and context gets eaten alive if you don’t manage tools properly.
Ilya Sutskever’s big thesis: we may be shifting from the age of scaling to an age of research, where new ideas matter as much as raw compute.
Commerce is getting “chat-first”: ChatGPT and Perplexity are pushing shopping assistants, which changes how brands get discovered — and increases the incentive to “optimize for LLMs.”
Side quest: Etsy’s discovery engine is insanely effective if you hit the right niche, and it raises the question: what happens when shopping moves from scrolling → chatting?

Deep dive #1 — The coding-model arms race is back (and it’s getting spicy)

A few months ago, it felt like the headline was always “bigger context” or “better reasoning.” Right now, the vibe is different: everyone is shipping models that are explicitly optimized for building things.

What Abhirup and Adi keep coming back to is a simple point:

A model isn’t “good at coding” if it can write a snippet.
It’s good if it can ship software — reliably.

What changed?

1) Tool use is finally becoming “less janky.”
The hosts call out an old failure mode that every builder has seen: the model runs tests, tests fail… and it confidently says “great, everything passed” and continues anyway. The big promise of the latest wave is not just higher benchmark scores, but repeatability and sanity in agent loops.

2) Longer horizon work is being productized.
Instead of “write a function,” the new focus is: “migrate a codebase,” “refactor a subsystem,” “run a multi-hour agent loop,” “keep going without losing the plot.”

The releases everyone’s reacting to

Gemini 3 (Google) — positioned as a major step up in reasoning + coding + tool use.
Claude Opus 4.5 (Anthropic) — described as a leap for “heavy-duty agentic workflows” and high-quality code.
GPT‑5.1 (OpenAI) + GPT‑5.1‑Codex‑Max — the “Codex Max” framing is explicitly about sustained, long-running coding tasks via techniques like session compaction.

The meta-take: it’s starting to feel like every frontier lab has a “break glass in case competitor ships” model waiting in the wings.

Deep dive #2 — Tool overload is the next bottleneck (MCPs, context, and “tool search”)

Once you start building real agents, you hit a different wall than “model intelligence”:

tools don’t scale cleanly.

If you stuff 500–1,000 tools into an agent, you can end up wasting a huge chunk of context just on tool definitions — before the model even thinks.

The problem: “tool catalogs eat context”

More tools can mean more power…
…but they can also mean worse performance, because the model is drowning in definitions.

The emerging solution: “don’t load tools — search tools”

This is where the episode’s MCP conversation gets practical:

Instead of loading every tool into the prompt, you maintain a tool registry / catalog.
The model searches the registry to find the small subset it needs.
Only those tool definitions get loaded into the working context.

This idea shows up in “tool search” patterns and in newer MCP-native infrastructure, and it’s one of the more important shifts in “agent architecture” right now.

Open-source spot of the week: `kit` (context engineering for devtools)

The hosts shout out an open-source approach to the “LLM devtools harness” layer: kit, which focuses on codebase mapping, symbol extraction, code search, and the practical scaffolding you need to build coding agents and dev workflows.

This is a strong signal of where things are going: the biggest differentiation isn’t always the model — it’s the harness around it.

Deep dive #3 — Ilya’s bet: we’re entering the “age of research”

A big chunk of Episode 3 is a reaction to Ilya Sutskever’s recent podcast conversation — and the contrast between two vibes in AI:

Big labs: “Scale compute, ship product, keep going.”
Ilya / SSI: “Research is the bottleneck; product can wait.”

The core question: what is AGI?

One of the most interesting threads here is definitional.

Is AGI “knowing everything”?
Or is AGI “being able to learn anything”?

The hosts interpret Ilya’s framing as leaning toward learning and generalization — not just absorbing the internet.

Value functions, RLHF, and “human-like” decision-making

Ilya’s critique (as discussed on the show) lands on a subtle but important point:

Most training loops reward “the right answer” in a very blunt way.

But humans don’t operate on a single reward channel. We have:

conflicting values,
emotions,
long-term goals,
weird biases,
social context.

The episode explores the idea that a path forward might involve models that aren’t just optimized to be “correct,” but to be value-driven in a more human way — which could affect reasoning, planning, and learning speed.

The pushback: research-only is brave… but risky

Adi raises the “two sides of the coin” point:

If you build a product, you get user feedback loops and real-world data.
If you don’t build a product, you can focus purely on research… but you might miss the insights that only come from distribution.

And of course, it sparks the obvious Silicon Valley question:

How do you justify massive valuations with no product?

Deep dive #4 — From Etsy discovery to chat-first shopping (trust is the whole game)

This is where Episode 3 gets extremely real-world.

Adi shares that he listed his physical product on Etsy expecting to manually funnel customers — and instead saw surprising organic traction.

That turns into a broader conversation:

Why Etsy feels “cracked”

When people are on Etsy, their intent is already high: they’re there to buy. So the platform’s job is matching buyers to the right niche items fast.

The takeaway: distribution still matters, and different platforms have different “intent shapes.”

But shopping is shifting

The hosts then connect the dots to shopping assistants:

ChatGPT shopping research
Perplexity shopping
(and the broader “chat becomes the interface” trend)

The trust problem (and the “is this an ad?” moment)

If an assistant recommends a product, users immediately ask:

Is this genuinely the best option for me?
Or is it sponsored / biased / optimized for revenue?

The episode references the awkward early moments of “assistant recommendations” that feel like ads — and how sensitive users are to that line.

The brand implication: “SEO 2.0” (LLM optimization / GEO)

If discovery moves from:

scrolling catalogs → asking assistants

then brands have to get good at being understood by LLMs.

That means:

clear positioning,
structured content,
consistent language,
and content that models can confidently map to the right persona.

In other words: LLM engine optimization is becoming a thing.

Builder’s corner — a few experiments inspired by the episode

If you’re building in this space, here are a few low-lift things to try this week:

Track “LLM referrals.”
Add UTM parameters and watch for traffic coming from ChatGPT / Perplexity style sources.
If you have >50 tools, stop loading tools. Start indexing tools.
Build a tool registry and add a “search → load minimal tools” layer.
Make your product describable in one sentence.
If you can’t describe your product clearly, an LLM won’t route users to it cleanly either.
Prototype the “AI makes my merch” loop.
The episode’s fun idea: ask an LLM to design a t-shirt tailored to your identity/vibe, then order it via print-on-demand. This is the kind of “agentic commerce” workflow that’s going to feel normal sooner than we think.

Wrap

Adi’s heading to a hackathon and is thinking about building a robot management system for factory floors — tracking robot location, battery, efficiency, and whether it needs help.

And the broader takeaway from Episode 3 is simple:

Models are improving fast, but the real leverage is in harnesses, tool infrastructure, and distribution.
That’s where the next wave of “real products” will come from.

See you in Episode 4.

Links & references (everything we talked about)

Frontier model releases

Gemini 3 announcement (Google): https://blog.google/products-and-platforms/products/gemini/gemini-3/
Claude Opus 4.5 (Anthropic): https://www.anthropic.com/news/claude-opus-4-5
GPT‑5.1 (OpenAI): https://openai.com/index/gpt-5-1/
GPT‑5.1‑Codex‑Max (OpenAI): https://openai.com/index/gpt-5-1-codex-max/

Tool use, MCPs, and scaling agent infrastructure

Tool Search Tool (Anthropic docs): https://console.anthropic.com/docs/en/agents-and-tools/tool-use/tool-search-tool
Advanced tool use on Anthropic’s dev platform: https://www.anthropic.com/engineering/advanced-tool-use
Anthropic on code execution with MCP: https://www.anthropic.com/engineering/code-execution-with-mcp
MCP milestone update (Anthropic): https://www.anthropic.com/news/donating-the-model-context-protocol-and-establishing-of-the-agentic-ai-foundation
kit (open-source context engineering toolkit): https://github.com/cased/kit

Ilya / SSI discussion

Dwarkesh podcast + transcript (Ilya Sutskever): https://www.dwarkesh.com/p/ilya-sutskever-2

Episode 3: Lab Wars are heating up

Listen

Notes

About Your Hosts

Lab Wars are heating up: coding models, tool search, and the future of shopping

Deep dive #1 — The coding-model arms race is back (and it’s getting spicy)

What changed?

The releases everyone’s reacting to

Deep dive #2 — Tool overload is the next bottleneck (MCPs, context, and “tool search”)

The problem: “tool catalogs eat context”

The emerging solution: “don’t load tools — search tools”

Open-source spot of the week: `kit` (context engineering for devtools)

Deep dive #3 — Ilya’s bet: we’re entering the “age of research”

The core question: what is AGI?

Value functions, RLHF, and “human-like” decision-making

The pushback: research-only is brave… but risky

Deep dive #4 — From Etsy discovery to chat-first shopping (trust is the whole game)

Why Etsy feels “cracked”

But shopping is shifting

The trust problem (and the “is this an ad?” moment)

The brand implication: “SEO 2.0” (LLM optimization / GEO)

Builder’s corner — a few experiments inspired by the episode

Wrap

Links & references (everything we talked about)

Frontier model releases

Tool use, MCPs, and scaling agent infrastructure

Ilya / SSI discussion

Shopping assistants + discovery

“SEO 2.0” / GEO / optimizing for LLMs

Mentioned product

Episode 3: Lab Wars are heating up

Listen

Notes

About Your Hosts

Lab Wars are heating up: coding models, tool search, and the future of shopping

Deep dive #1 — The coding-model arms race is back (and it’s getting spicy)

What changed?

The releases everyone’s reacting to

Deep dive #2 — Tool overload is the next bottleneck (MCPs, context, and “tool search”)

The problem: “tool catalogs eat context”

The emerging solution: “don’t load tools — search tools”

Open-source spot of the week: kit (context engineering for devtools)

Deep dive #3 — Ilya’s bet: we’re entering the “age of research”

The core question: what is AGI?

Value functions, RLHF, and “human-like” decision-making

The pushback: research-only is brave… but risky

Deep dive #4 — From Etsy discovery to chat-first shopping (trust is the whole game)

Why Etsy feels “cracked”

But shopping is shifting

The trust problem (and the “is this an ad?” moment)

The brand implication: “SEO 2.0” (LLM optimization / GEO)

Builder’s corner — a few experiments inspired by the episode

Wrap

Links & references (everything we talked about)

Frontier model releases

Tool use, MCPs, and scaling agent infrastructure

Ilya / SSI discussion

Shopping assistants + discovery

“SEO 2.0” / GEO / optimizing for LLMs

Mentioned product

Open-source spot of the week: `kit` (context engineering for devtools)