the part of rag everyone ignores


i spent sunday afternoon watching my chatbot explain tree pruning to someone asking about git branches. i was at my desk, three monitors glowing, coffee gone cold. the user typed "how do i delete a local branch" and my rag system pulled chunks from a gardening pdf. same embeddings. same index. complete nonsense.
you've probably done this. i know i have.
the problem wasn't my vector database. it wasn't even the llm. it was how i'd sliced and tagged my documents. my chunks were random. my metadata was lazy. the embeddings couldn't tell the difference between "branch" the git concept and "branch" the tree part. i'd spent six hours building a pipeline that worked perfectly for the demo data. real data broke it in ten minutes.
why chunk size ruins everything
i used to think 512 tokens was magic. everyone says so. just slice your docs every 512 tokens and call it a day. but here's what actually happens: you get a chunk that starts with "the branching process requires..." and ends with "...pruning schedule for winter growth." the embedding vector becomes this muddy average of two completely different topics. it's not wrong. it's just confused.
my first time trying this was on a monday morning. i'd indexed 200 technical docs overnight. proud, i asked my system "what's the merge procedure?" it returned a paragraph about git merges. also a paragraph about grafting trees. same chunk. i'd split right through a heading.
the problem isn't what you think. it's not the embedding model's fault. it's that you're asking it to represent two ideas as one. the math does what the math does. it averages. and averaging "branch" the verb with "branch" the noun creates garbage.
look, here's what i mean. i tested three chunk sizes on the same document set. 256 tokens gave me 1,200 chunks. 512 gave me 600. 1,024 gave me 300. the 256-token index was slow but accurate. the 1,024-token index was fast but returned junk. middle ground? there is no middle ground. you pick your failure mode.
and the funny part? my coworker ran the same test and got opposite results. his docs were shorter. his "right" size was my "wrong" size.
the metadata lie
most tutorials tell you "just add metadata." source, date, author. boom, done. i followed this. i added three tags to every chunk. then i tried to filter by them. "show me only 2024 docs," i typed. my system returned nothing. why? because i'd tagged everything with "2024" as a string, but my filter expected a number.
the first time i tried metadata filtering, i spent three hours debugging. three hours. for a one-line config change. i'd stored "category: tech" but searched for "category: technology." exact match failed. my system wasn't smart enough to know they're the same.
but the real problem is deeper. even when it works, metadata makes developers lazy. i caught myself thinking "i don't need good chunks, i'll just filter by tag." that's how you get a system that works for one query and fails for everything else. the tags become a crutch.
what actually happens is your metadata schema grows like a weed. start with three fields, end with thirty. "department," "region," "access_level," "is_approved," "review_status." pretty soon you're not building a search system. you're building a database with extra steps.
hybrid search is a headache dressed as a solution
my coworker tried to solve our retrieval problems with hybrid search. keywords plus vectors. best of both worlds, right? he spent two weeks tuning the weights. 70% vector, 30% keyword. then 60/40. then 80/20. every change fixed one query and broke two others.
the promise is simple. vector search gets meaning. keyword search gets exact matches. combine them, you get both. the reality is a tuning nightmare.
i asked him "how's it going?" he showed me a spreadsheet with 47 test queries. each column was a different weight ratio. green cells meant "good result." red meant "bad." it looked like a christmas tree with measles. no pattern. just chaos.
most people ask if they should use hybrid search. here's the honest answer: only if you have no choice. if your users search for product codes or error numbers, maybe. but if they're asking natural questions, pure vector search is cleaner. simpler. less likely to surprise you.
the embedding model trap
i changed my embedding model last month. upgraded to a newer, better one. that's when i learned about vector drift. the new model generated completely different vectors for the same text. my index had old vectors. new queries used new vectors. they didn't match. my system became a random answer generator.
re-embedding sounds easy. just run everything again. but i had a million chunks. that takes days. costs money. and what if the next model is better next month? do i keep re-embedding forever?
the first time i tried this, i thought i could be clever. i'd keep both models. query both indexes. merge results. my latency doubled. my bill tripled. and results were still weird.
look, here's what i wish i'd known. pick a model and marry it. don't upgrade just because something new comes out. the improvement is usually small. the migration pain is always huge.
and benchmark properly. i'd been testing on 100 docs. when i scaled to 10,000, performance tanked. turns out the model i chose was fast on small data, slow on large.
why i now name all my indexes after fish
this is unrelated but it matters. i got tired of calling things "prod_index_v2_final_actual." so i started naming them after fish. salmon, trout, barracuda. it makes standups fun. "how's tuna performing?" "tuna's slow today."
but it also solved a real problem. people remember names better than version numbers. when someone asks which index has the medical docs, i say "sturgeon." not "index_v14_medical_2024_11." the name carries no technical weight. it's just a handle. a way to talk without confusion.
my manager hates it. he wants semver and documentation. i want to remember what i'm talking about. we compromised. the official name is "medical_index_v3." the fish name is in the comments. everybody wins.
most people shouldn't build rag systems
here's the blunt truth. if you have less than 10,000 documents, just use keyword search. it's faster. it's cheaper. it works. rag shines when you have messy, unstructured data that keyword search can't handle. most companies don't have that problem. they just think they do.
the cost is real. embeddings aren't free. vector databases aren't free. the time you spend tuning chunk sizes and metadata schemas is time you could spend fixing your actual product.
and the maintenance? brutal. every new document needs embedding. every model update risks breaking everything. it's not set-and-forget. it's a pet that needs constant feeding.
who should bother? if you're building a chatbot that answers questions about 50,000 technical docs, maybe. if you're doing legal research across millions of case files, sure. but if you're just trying to search your company wiki? use postgres full-text search. it's fine.
this is overkill for small sites
i keep seeing tutorials that say "every app needs rag." no. your blog doesn't need rag. your portfolio site doesn't need rag. that todo app definitely doesn't need rag.
the hype cycle is annoying. developers feel pressure to use the fancy new thing. but sometimes the boring old thing works better. elasticsearch has been around for a decade. it solves most search problems. it's boring. it works.
i spent a month building a rag system for my personal notes. 300 notes. total. keyword search finds what i need in 50 milliseconds. my rag system took 2 seconds. i turned it off.
i still think about that sunday. should have walked my dog. but at least now when i see tree care tips in my git results, i know exactly which config file to fix. small win. i'll take it.
Enjoyed this article? Check out more posts.
View All Posts