all notes
2026-07-01Naman Barkiya

4 questions before adding AI to your product.

A product needs AI if it processes language or unstructured data, cannot be answered by a lookup or filter, has an evaluation method to confirm correct output, and justifies the inference cost — features that fail two or more of those four conditions are better served by a simpler approach.

Most founders who add AI don't ask whether it answers a question worth asking. Four conditions. If a feature fails two, it's not an AI problem.

A product needs AI if it processes language or unstructured data, cannot be answered by a lookup or filter, has a way to confirm whether the output is correct, and justifies the inference cost at your expected volume. Those four conditions, in that order.

Here is the test we run before we add any AI feature to a client's product. If a feature fails two or more of those conditions, we recommend building the simpler version first — a feature that doesn't need AI will degrade faster and cost more to maintain than the same feature built without it. Most founders who add AI skip this test. The result is an AI feature that costs more than it earns and quietly worsens over time.

The four conditions

1. The task involves language, images, or unstructured data.

AI earns its cost when the input is something a database query can't parse — free-form text, voice, images, PDFs, support tickets, user-generated content. A query can decide membership tiers. A language model can decide whether a support ticket is frustrated or routine. Those are structurally different problems. Know which kind you have before you pick the tool.

If the input is structured — IDs, dates, categories, numeric fields — start with SQL and a rules engine. The AI case only opens when the input resists structure.

2. A lookup table or rule-set won't answer the same question.

Ask out loud: could this be a well-written SQL query plus a rules engine? If the answer is yes — even uncomfortably — that is the right tool. SQL doesn't hallucinate. A rule doesn't drift. It's also cheaper to run, easier to test, and faster to debug.

Most "AI" feature requests turn out to be filtering problems, not inference problems. The framing fools people because the desired output feels intelligent. Spend twenty minutes writing the rule-set before you write the prompt. If the rule-set answers it, ship the rule-set.

3. You can evaluate whether the output is correct.

This is the condition that gets skipped most often, and the one that makes everything else unpredictable.

Without an evaluation harness — a test set, a human rater, an automated correctness check — you cannot tell if a model change made the feature better or worse. You are shipping changes into production and hoping. At LaunchProd, we built the evaluation harness in month three instead of week one. We couldn't tell whether our retrieval changes were helping until something broke in a way a user noticed. That was the most expensive mistake in the build.

The rule: if you cannot write down what a correct output looks like, you are not ready to ship the feature. Define "correct" first. Build the check second. Ship the AI third.

4. The inference cost fits the unit economics.

Do the math early, before you commit the architecture. Model cost per call multiplied by expected volume.

At $0.001 per call and 100,000 monthly sessions: $100. At $0.05 per call at the same volume: $5,000. Neither number is inherently wrong. Both are numbers you should know before you write the first prompt. At modest scale — 50,000 monthly users, one call per session — inference typically runs $25–$250 per month depending on the model and prompt length. The hidden cost is the evaluation infrastructure: two to four engineering weeks to build the harness that tells you whether the output is correct. Skipping that is how AI features quietly degrade after launch.

What the failures look like

The most common failure is a matching algorithm rebranded as AI. A playlist ranked by user history is a weighted query. A "personalized" feed sorted by recency and engagement is a sort order. These are not AI features — but they are shipped as AI features on a dozen products we've reviewed.

The second-most-common failure is a keyword filter that would have been cheaper and more predictable. A content moderation layer that flags profanity could be a wordlist and a regex. Reached for GPT because it felt more powerful. The inference cost at scale turned out to be $2,000 per month for a feature the regex would have handled for zero.

The third failure is a GPT wrapper that required no evaluation because the output never mattered. The AI was decorative. Users clicked past it. It added latency and cost to a feature nobody cared about.

The counter-example: LaunchProd used RAG for private document retrieval — creators' own content libraries, product catalogs, brand guides. A raw LLM call would have hallucinated citations from training data. The RAG layer grounded the responses in documents the user had actually uploaded. That is a correct use of the technology — the task required retrieval, the failure mode of hallucination was real, and we could evaluate whether the output was grounded.

The contrast: four of forty-two features on a gaming platform we shipped used no AI at all, and the users didn't notice. The features that didn't need AI were built without it. The ones that did — challenge verification from video clips, language-agnostic chat moderation — earned it.

The wrapper trap

A raw API call wrapped in a UI is a demo, not a product. The demo takes a weekend. The product takes months — not because the AI is hard, but because durability comes from the surrounding system.

A wrapper becomes a product when three things are in place: the output is evaluated (you know when it's wrong), there is a feedback loop (wrong outputs improve the system over time), and retrieval is added when the answers need to be grounded in private or frequently-updated data rather than training-set knowledge.

We covered the decision between these approaches in detail in When to use RAG, when to fine-tune, and when to just prompt. The short version: most products need prompting first, RAG when retrieval is required, and fine-tuning almost never. A wrapper that skips the evaluation and feedback steps isn't durable — a competitor ships the same wrapper in a weekend, and you have no structural advantage.

The wrapper trap is also a cost trap. Without evaluation, you cannot tell that the model is degrading. Provider updates, prompt drift, and edge cases accumulate silently. The feature ships well and gets worse quietly until a user notices.

So: does your product need AI?

Run the four conditions. If a feature fails two or more of them, build the simpler version first — ship it, measure whether it actually fails users, then revisit AI with an evaluation harness already in place. The rule: if you cannot write down what a correct output looks like, you are not ready to ship the feature.

Written 2026-07-01 by Naman Barkiya.

FAQ

Questions this usually surfaces.

Is adding a GPT wrapper to my app enough to call it an AI product?
If the wrapper is the whole product — here is your prompt, here is the response — it is a demo, not a product. A wrapper becomes durable when it sits inside a workflow where the output is evaluated, acted on, and improves. Without those three properties, a competitor ships the same wrapper in a weekend.
How much does it cost to add AI to an MVP?
Inference runs $0.50–$5 per thousand calls depending on the model and prompt length. At modest scale (50,000 monthly users, one call per session), that is $25–$250 per month — often fine. The hidden cost is evaluation infrastructure: you need a way to know whether the AI output is correct before it reaches users. That is typically two to four engineering weeks, and skipping it is how AI products quietly degrade after launch.
When should a product use RAG instead of a raw LLM call?
When answers need to be grounded in private or frequently-updated data — a knowledge base, a product catalog, a document library — and a hallucinated answer would damage trust or be functionally wrong. RAG adds two to four weeks of engineering and a vector database; it is not the right default for every AI feature, only for retrieval-dependent ones.