The vector database I never actually chose
ChromaDB is the right tool for a prototype. Mine shipped to production, dragged a silent retrieval bug along with it, and the fix had been sitting in my stack the whole time.
There is a function in this system that copies an entire database to /tmp so that the database will agree to be read. I wrote it, I shipped it, and for months I did not treat it as the warning it was. This article is about the decision hiding behind that function: ChromaDB as a vector store, picked by default, kept long past the point where it stopped fitting.
The previous article in this series was about five decisions I could defend. This one is about a decision I couldn’t, and what it took to earn the one-sentence answer I should have had from the start. Along the way I found a bug in my retrieval layer that had been there since the first commit, which no error message was ever going to surface.
What the system is
The product is an audio generator I built for a wellness company. A person answers a short questionnaire about what they’re struggling with and how they want to feel, and a personalized guided audio session lands in their inbox: a script written for their situation, read in a professional voice, ten minutes or more of audio.
Under the hood it is an Azure Durable Functions orchestration with three activities in a chain: generate the script, generate the audio through ElevenLabs, send the email. Leads and their statuses live in Postgres. Remember that last detail, because it is the most expensive sentence in this article.
The script generation is where retrieval comes in. Each session has a structure: an induction that relaxes you, a long affirmations section that does the therapeutic work, and an awakening that brings you back. I had a library of 92 reference scripts, split into three collections by section type. For each generation, the system embeds a description of the user’s situation, queries each collection for the closest matching scripts, and feeds them to the language model as style examples:
def retrieve_similar_scripts(collection, focus_area, challenge_context, top_k):
query = f”A script that helps with {focus_area}, the context is {challenge_context}”
query_embedding = get_embedding(query)
results = collection.query(query_embeddings=[query_embedding], n_results=top_k)
return results[”metadatas”][0]
So this is RAG, but not the kind that looks up facts. The retrieved scripts are exemplars. They teach the model what a good induction sounds like for someone dealing with insomnia versus someone preparing for surgery. The generation prompt does the personalization; retrieval supplies the craft.
Two questions I couldn’t answer
There is a test I now apply to every component I ship: explain why it is there, in one sentence, without looking anything up. When I first ran my retrieval layer through that test, it failed twice.
What kind of retrieval is this? Look at the code above: it embeds the query and ranks documents by vector similarity. That is dense retrieval, what most people mean by semantic search. Keyword search is the other family, the BM25 style matching where “insomnia” has to literally appear in the document. There is a third option, hybrid, that runs both and merges the results, usually with a reranker on top. I knew none of these names when I built this. I had assembled dense retrieval by following examples, and I would have undersold my own system describing it, because you cannot defend a choice between options you cannot name.
Why ChromaDB? The honest answer is that every tutorial I learned from used it, so I did too. In the last article I wrote that “one model for everything” is the absence of an architecture decision. A vector database you can’t explain is the same absence, one layer down. ChromaDB was a decision I inherited from learning when I started to learn about RAG, so to speak, and since this option was the one the system already worked well with, and the time I had to build this was limited, I just went along with it.
The answer the system deserved
What I only understood after studying retrieval properly is that ChromaDB was a defensible choice. Just not a permanent one.
ChromaDB is an embedded database, like SQLite for vectors. You pip install it, point it at a folder, and you have similarity search with zero infrastructure. For a prototype with 92 documents that needed to prove the concept worked, that was right. Standing up extra infrastructure to search 92 documents would have been the opposite mistake. Asked on day one of the project, “it is the fastest path to a working prototype” is a fine answer.
The mistake was that the prototype shipped, and the decision never got revisited. And my own codebase was already telling me it didn’t belong in production. I just wasn’t listening:
def _ensure_writable_chroma_db() -> str:
# Azure Functions mounts the deployment read-only, and Chroma’s
# SQLite needs to write its WAL files even to serve reads.
src = os.path.join(os.path.dirname(__file__), “..”, “chroma_db”)
dst = “/tmp/chroma_db”
if not os.path.isdir(dst):
shutil.copytree(src, dst)
return dst
I had to write this because the deployment package on Azure Functions is read-only, and Chroma’s underlying SQLite wants write access just to answer queries. So every worker copies the entire database to its own /tmp on first use. The vector store ships inside the deployment artifact like a config file, which also means updating the corpus requires a redeploy. None of this is ChromaDB’s fault. It is what happens when you take a database designed to live on your laptop and staple it to a serverless runtime. When you find yourself copying your database to /tmp so it will agree to be read, the architecture is speaking to you.
Now the expensive sentence from earlier. Postgres was already in the stack. The same function app held a connection to it, saving leads on every purchase. pgvector, the extension that turns Postgres into a vector store, would have been one CREATE EXTENSION vector away: a table with an embedding column, an index, and retrieval becomes a SQL query against a database that already had backups, access control, and a connection pool. Updating the corpus becomes an INSERT instead of a redeploy. The production-grade option required no migration project at all. It was already running, and I was deploying a second database alongside it because the tutorials had handed me one.
Pinecone is the right answer to a different question. A managed vector service earns its place when you have millions of embeddings, multiple services querying them, and index performance worth paying a vendor to own. At 92 documents sitting next to an existing Postgres, choosing Pinecone would be the ChromaDB mistake with a monthly bill: picking infrastructure by reputation instead of by fit.
The bug I found while writing this
Studying retrieval gave me the vocabulary. Rereading my code with that vocabulary found a bug that had been live since the beginning.
The script that built the vector store embedded all 92 documents with text-embedding-ada-002. The production query path embeds with text-embedding-3-small. Both models output 1536 dimensions, so nothing ever crashed. But vectors from different embedding models do not live in the same space. Comparing them is like comparing coordinates from two different maps that happen to use the same number format.
I fixed it before publishing, and the before and after says everything. I asked the broken store for affirmation scripts about quitting smoking. It returned scripts about lovability, wealth, and anxiety, with similarity distances so tightly bunched that the ranking carried no information; every document sat almost exactly as far from the query as every other, which is what near-orthogonal vectors look like. After re-embedding the corpus with the same model the queries use, the identical query returns the stop-smoking script first, by a wide margin, followed by scripts on freedom from drugs and gambling. The retrieval had been choosing examples at random for its entire life, and the fix was one afternoon and a few cents of embedding calls.
No one noticed because the failure was silent in exactly the way vector search likes to fail. Every script in a collection was a competent example of its section type, so whatever the query pulled was usable, and the generation model produced something reasonable from it. The system never errored. It just quietly ignored the personalization that retrieval was supposed to provide. pgvector would not have caught this either, by the way. Dimensions matched, queries ran. The thing that catches it is the thing this series keeps arriving at: an evaluation. Five test queries with expected matches would have exposed it in the first week. Without that measurement, what I was calling retrieval was a random example picker that billed me for an embedding call each time.
A mismatch between how documents are processed at index time and how queries are processed at read time is one of the most common silent failures in production retrieval systems, and embedding-model drift is one of its quietest forms. Knowing the failure exists is the difference between catching it in an afternoon and shipping it for months. I had shipped it for months.
I would rather not have published this paragraph. It stays because it is the strongest evidence for the article’s point: the gap between using retrieval and understanding it was real, and it was mine. The bug was not hiding in some obscure corner. It sat in plain sight between two files, waiting for someone who knew enough to ask whether the query and the corpus spoke the same language.
What the next version changes
The migration list is short, which is part of the lesson:
Move the vectors into the existing Postgres with pgvector. One table, three section types as a column instead of three collections, and the embedding model’s name stored alongside the vectors, so the mismatch I just described is visible in the data instead of silent between two files. The /tmp copy ritual goes away, and updating the corpus stops requiring a deploy.
Write the retrieval eval. A dozen test cases mapping user situations to the scripts a practitioner would pick. I ran a two-query version of this by hand while fixing the embedding bug, and that crude check was enough to expose a problem that had survived months of the system working. The real harness turns “I think retrieval works now” into a number, and it is the cheapest insurance this project never bought.
Name the retrieval type in the design notes. Dense retrieval, deliberately. At 92 curated exemplars there is no hybrid search to justify; nobody is searching for exact product codes in a guided audio script. If the corpus grows past a few thousand documents and retrieval quality drops, hybrid with a reranker is the next experiment, and now I can say why.
What I’d tell you to take from this
Every component in your stack should survive one “why” question. Not a paragraph, one sentence. “ChromaDB, because it’s embedded and this was a 92-document prototype” would have been enough. I didn’t have the sentence, and the absence of the sentence pointed at an absence of understanding. The embedding bug is what that absence costs in practice: not a crash, just a system quietly doing less than everyone believes it does.
The closing of the last article said to write down why you chose things at the moment you choose them. I’d now add the harder half: schedule the moment where you re-ask the question. ChromaDB was right on the first commit and wrong by the time the system was deployed, and no part of my process existed to notice the answer had changed. Writing this retrospective is what noticed, and a habit would have been cheaper than a discovery.
Jorge Garcia is a Lead AI Engineer building multi-agent systems in production. This is the second article in a series on the architecture decisions behind real LLM products.

