What was the biggest mistake in the LaunchProd RAG build?

Building the evaluation harness in month three instead of week one. Without it, we couldn't tell whether changes were helping until they broke something a user noticed.

Why does refuse-to-answer improve retrieval quality?

Users trust a system that admits uncertainty more than one that bluffs. Adding a retrieval-quality threshold that triggers a graceful refusal measurably increased user confidence in LaunchProd's output.

The RAG architecture behind LaunchProd, and three things we'd do differently

LaunchProd is an AI platform founded out of Carnegie Mellon University, helping creators launch branded product lines instead of living off sponsorships. The core experience is a creator uploads source material (their bio, brand voice, product sketches) and the system generates a full launch pack: landing page, email sequence, product description, press copy. The engine underneath is a retrieval-augmented system we built end to end, iterated in production, and eventually simplified. Here is what it looked like, what worked, and three things we would do differently.

The v1 architecture

A creator profile plus brand guidelines landed in a document ingestion pipeline. We chunked at three levels: sentence, section, document. Each chunk was embedded with OpenAI's embedding model and stored in pgvector on Supabase. A query retrieved the top-k chunks at all three levels, reranked with a cross-encoder, and fed the reranked context into the generation prompt.

This worked. Output quality on the first generation was better than prompting alone by a meaningful margin, measured against a 40-sample evaluation set. It shipped in production and held.

Where it started hurting

Around month three, the pipeline started accumulating symptoms. Latency grew because the rerank step was doing work twice on overlapping chunks. Retrieval quality plateaued because chunk boundaries occasionally split the actually-useful context. And the generation prompt, which we kept extending to compensate for retrieval gaps, started bumping into context window limits on long creator profiles.

The temptation at that point is to reach for something bigger. A larger model. A fine-tune on creator data. Agentic retrieval.

We did the opposite. We simplified.

The rule: before you add complexity to a retrieval system, check whether your existing complexity is fighting itself.

The v2 architecture, simplified

Single-level chunking, at the section level only. No multi-tier retrieval. No rerank. A better embedding model (we switched providers). A shorter, stricter prompt that named the retrieved context explicitly and refused to generate if retrieval quality fell below a threshold.

Latency dropped by roughly 35%. Output quality, against the same evaluation set, was unchanged on most samples and better on the worst tail. Infrastructure complexity dropped more than anything else: half the code, half the monitoring, half the on-call surface.

Three things we'd do differently

One: evaluation from day zero, not day ninety. We built the evaluation harness in month three, after we needed it. Had we built it first, the v1-to-v2 simplification would have taken a week instead of a month.

Two: one chunking strategy, hypothesis-first. We shipped v1 with three chunking levels because we couldn't decide which would work best. The right move was to pick one, measure it, and change it based on evidence. Shipping every strategy simultaneously hid which one was actually earning its keep.

Three: refuse-to-answer is a feature, not a fallback. The biggest single quality win in v2 was allowing the system to say "I don't have enough context to generate this." Users trusted the product more, not less, when it refused. We should have shipped that in week one.

Heuristics

Evaluation harness before feature. Always. A retrieval system without evaluation is a pile of opinions.
Simplify before you scale. Most retrieval complexity compounds against itself after six months of drift.
Refuse-to-answer is a quality signal, not a failure mode. Users can tell when a product is bluffing.

The broader question of whether LaunchProd should have been RAG at all, or prompt-only, or fine-tuned, is answered here. For LaunchProd, RAG was correct; for most of the AI engagements we've turned down since, it would have been overkill.

Written 2025-10-03 by Naman Barkiya.

The RAG architecture behind LaunchProd, and three things we'd do differently.

The v1 architecture

Where it started hurting

The v2 architecture, simplified

Three things we'd do differently

Heuristics

Questions this usually surfaces.