דילוג לתוכן
חזרה לבלוג

מ-RAG לפרודקשן: לקחים משליחת מערכות AI

A notebook that answers your ten test questions is maybe twenty percent of a production RAG system. I learned the other eighty percent the slow way, by building and operating these end to end, solo: the knowledge-base retrieval behind my own platform, which ingests content across several Sanity projects; a property-document Q&A system I built; and a real-estate CRM with semantic search on Postgres. The retrieval call is the easy part. Everything around it is the work.

The prototype-to-production gap

[@portabletext/react] Unknown block type "mermaid", specify a component for it in the `components.types` prop

The demo works because you wrote the questions and you chunked the documents you tested against. Production breaks because real users ask things your chunks can't answer, your corpus changes underneath you, and the model you embedded with last quarter is no longer the model you want.

What actually fills the gap: ingestion you can re-run, embeddings you can migrate, retrieval you can filter and inspect, and evaluation that tells you when quality drops before a user does. None of that shows up in a notebook.

Chunking is a product decision, not a default

The single biggest lever on retrieval quality is how you split documents, and the default "1000 characters with 200 overlap" is almost always wrong for real content.

Property documents broke this immediately. A fixed character window cuts a table mid-row, splits a clause from its heading, and strips the one piece of context that made the chunk answerable. When the chunk says "the deposit is due within 14 days" but the section heading "Termination" landed in the previous chunk, retrieval returns a true sentence that answers the wrong question.

Clinical text from the voice-to-chart system broke it in the other direction. Transcribed Hebrew visit audio has no clean paragraph structure, so character-based splitting scatters a single finding across chunks and you retrieve half a symptom.

How I chunk now:

  • Split on document structure first (headings, sections, list boundaries), then size-cap, rather than splitting on character count and hoping structure survives.
  • Carry parent context into the child chunk. Prepend the section heading or document title to the chunk text that gets embedded so a clause is never orphaned from what it belongs to.
  • Store the raw span and the embedded span separately. The text you embed (heading + body) is not always the text you show the user.
  • Tune chunk size to the question shape. Q&A over dense legal or property text wants smaller, heading-anchored chunks. Narrative or clinical summaries want larger ones that keep a finding intact.

Chunking is where you encode what a "unit of answer" means for your product. Treat it like a default and you've shipped a default product.

Embeddings and the model-change tax

Choosing an embedding model is a one-day decision. Living with it is the expensive part, because the day you want a better model you have to re-embed your entire corpus, and the embeddings are not comparable across models. You cannot mix vectors from two models in one index and expect distances to mean anything.

This is the model-change tax, and it is real every time: a new model ships, or you switch from a hosted API to a local model for cost or privacy, and now every chunk in production has to be recomputed. For a live corpus that is still taking writes, you cannot stop the world to do it.

This is exactly why I run ingestion and embedding as background jobs rather than inline on the request path. On my platform, embedding runs as a Celery task on the Python workers, not a synchronous call inside the API handler. That separation is what makes a re-embedding migration survivable:

  • Write new vectors into a new column or a new index alongside the live one, computed by background workers, so reads keep serving the old embeddings until the backfill is complete.
  • Make the embedding job idempotent and resumable. A backfill over a real corpus will get interrupted, and you need to restart it without recomputing what's done or double-writing.
  • Treat queue pressure as a first-class signal. A re-embedding backfill competes with live ingestion for the same workers, so either isolate it on its own queue or rate-limit it, or you'll starve real-time writes behind a migration.
  • Cut over reads atomically once the new index is fully populated, then drop the old one.

If embedding lives inline in your request handler, none of this is available to you, and the first model upgrade becomes a rewrite.

Why I run RAG on pgvector instead of a dedicated vector DB

[@portabletext/react] Unknown block type "codeBlock", specify a component for it in the `components.types` prop

Across these systems the vector store is Postgres with pgvector, not a dedicated vector database. That is a deliberate choice, and for most products I'd make it again.

The reason is that retrieval is almost never pure vector search. You want the nearest chunks where the tenant is this user, the document type is a lease, the language is Hebrew, and the record isn't archived. In Postgres that's one query: a vector distance ordering with ordinary WHERE clauses and joins, inside a transaction, against data that already lives in the same database as the rest of your application. With a separate vector DB you are now running two stores, keeping them in sync, and reimplementing your filters and joins on the wrong side of a network hop.

The honest limits:

  • At very large scale, dedicated engines win on raw ANN performance and index build time. pgvector with HNSW is good, but it is not the frontier of vector-search performance and you will feel that past a certain corpus size.
  • Index builds and recall tuning are real operational work. HNSW parameters trade build time and memory against recall, and the right setting depends on your corpus.
  • You are sharing a database with your transactional workload, so a heavy index build or a large backfill competes with application traffic for the same resources.

None of the systems I run is at the scale where those limits bite harder than the cost of operating a second database. For a solo operator, one Postgres I already back up, monitor, and run migrations against beats two stores I have to keep consistent. You probably don't need a separate vector database, and you should make someone justify the second system before you adopt it.

Retrieval quality is the whole game

Everything upstream exists to make this one call good, and the failure modes are rarely "the model is bad." They're "we retrieved the wrong chunks."

Pure vector similarity misses exact terms. A user searches a specific clause number, a property reference, a drug name, and the embedding returns something semantically close but not the literal match they asked for. Hybrid retrieval fixes this: combine vector similarity with keyword or full-text search so exact terms get their weight back. Postgres gives you full-text search in the same engine as the vectors, which is another reason the single-store choice pays off.

Metadata filtering matters as much as ranking. The fastest way to wreck precision is to search the whole corpus when the answer can only live in one document type or one tenant's data. Filter hard before you rank, and your top results get dramatically better for free.

The cross-lingual failures are the ones monolingual demos never surface. When the source documents are in one language and the users ask in another, naive retrieval embeds the query and the documents in different language spaces and quietly returns nothing relevant. You feel none of this on an English-only demo. What it forces:

  • Embed with a genuinely multilingual model, and verify cross-language retrieval directly rather than assuming the model handles it.
  • Decide deliberately whether you retrieve in the source language and translate, or translate and then retrieve. The two give different results and different failure modes.
  • Watch RTL and script-specific tokenization. Hebrew and right-to-left text expose chunking and normalization bugs that never appear in Latin-script content.

Retrieval is only as good as your ingestion. If chunking orphaned the context or embedding flattened the language, no amount of reranking saves you.

Evaluation you can actually run

[@portabletext/react] Unknown block type "mermaid", specify a component for it in the `components.types` prop

For a long time, evaluating RAG meant me reading outputs and deciding if they looked right. That does not scale and it does not catch regressions.

I run Langfuse across these systems to trace and evaluate retrieval. The thing that changes is that every retrieval becomes inspectable: for a given question you can see which chunks were retrieved, their scores, and what the model did with them. When an answer is wrong you can tell whether retrieval failed or generation failed, which are completely different bugs with completely different fixes.

What I build on top of that:

  • A regression set of real questions with known-good answers, drawn from actual usage rather than invented. Run it on every meaningful change to chunking, the embedding model, or retrieval parameters.
  • Tracing on the retrieval step specifically, so a quality drop shows up as "we stopped retrieving the right chunk" rather than a vague "answers got worse."
  • A bias toward catching regressions before users do. A change that improves one class of question often quietly breaks another, and without a regression set you find out from a complaint.

You cannot improve what you cannot see. Tracing retrieval is the difference between debugging RAG and guessing at it.

The operational tail nobody demos

This is the part that never appears in a tutorial and is most of what running these systems actually involves.

  • Index maintenance. Vector indexes need rebuilds and tuning as the corpus grows, and a build heavy enough to matter competes with live traffic on a shared database.
  • Re-embedding migrations. Covered above, and they recur every time you change models. Background jobs and a parallel-index cutover are what make them routine instead of an outage.
  • Queue pressure. Ingestion, embedding, and backfills share Celery workers. Without isolation or rate limits, a backfill starves real-time ingestion and your freshest data is the slowest to appear.
  • Cost. Embedding a large corpus and re-embedding on every model change is a recurring bill, not a one-time setup cost. It changes which models you can afford to switch to.

Operating these solo is what makes the tail legible. When you own ingestion, embedding, retrieval, the API, and the infra, you can't hand the migration to another team, so you design for it from the start.

A pragmatic production checklist

  • Chunk on document structure, carry parent context into the chunk, and size for your question shape. Never ship the default splitter unexamined.
  • Run embedding as a background job from day one, even if it feels like overkill, so the first model change isn't a rewrite.
  • Make embedding jobs idempotent and resumable, and give backfills their own queue.
  • Start on pgvector. Filter on metadata in the same query as the vector search. Make someone justify a second store before you adopt it.
  • Use hybrid retrieval so exact terms survive, and filter hard before you rank.
  • If you touch more than one language, verify cross-lingual retrieval directly and decide your translate-then-retrieve order on purpose.
  • Trace retrieval (Langfuse or equivalent) and keep a real regression set. Run it on every change to chunking, embeddings, or retrieval.
  • Budget for the operational tail: index rebuilds, re-embedding migrations, queue pressure, and recurring embedding cost.

The prototype proves the idea. The pipeline, the migrations, and the tracing are what make it something you can run.