search that understands what you meant, running on the laptop

A small robot peering at a screen

I have about nine years of notes. Markdown files, mostly, with the occasional PDF I told myself I'd read properly later. Grep finds the word, but grep cannot find the idea. I knew I'd written something about backpressure in a message queue, and I knew it wasn't filed under "backpressure", and I spent twenty minutes failing to surface it. That was the afternoon I decided to put a proper semantic index over the lot.

The interesting constraint I set myself: it all runs on the laptop. No API key, no per-token billing, nothing leaving the disk. Partly principle, partly because my notes contain enough half-formed opinions about former colleagues that I'd rather they stayed local.

What an embedding actually buys you

An embedding is just a vector. You take a chunk of text, push it through a model, and get back a fixed-length array of floats, say 384 or 768 of them. The clever bit is that the model has been trained so that text with similar meaning lands close together in that space. "How do I throttle a producer that's outrunning the consumer" and "backpressure on a queue" end up near each other even though they share almost no words.

Search then becomes: embed the query, find the nearest stored vectors by cosine similarity, return the chunks they came from. That's the whole trick. No keyword overlap required.

The catch people gloss over is chunking. You can't embed a 4,000 word note as one vector and expect anything useful, because the meaning gets averaged into mush. So you split. I settled on roughly 200 to 300 word chunks with a bit of overlap so a sentence straddling a boundary isn't lost.

The model

For on-device work in September 2025 the sentence-transformers family still does the job without drama. I used all-MiniLM-L6-v2. It's small, it's old enough to be boring, and boring is exactly what I want here. 384 dimensions, runs happily on CPU, embeds a chunk in a few milliseconds. There are bigger, better models, but the gap in quality is not worth the gap in latency for a personal note search.

A circuit board close-up

The pipeline, stripped down:

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

def embed(texts):
    # normalise so cosine sim is just a dot product
    return model.encode(texts, normalize_embeddings=True)

chunks = load_and_chunk("notes/")          # list of strings
vectors = embed(chunks)                     # (N, 384)
np.save("index.npy", vectors)

Normalising the vectors up front matters more than it looks. Once everything is unit length, cosine similarity is just a dot product, and a dot product against a matrix is one vectors @ query call. For my corpus, a few thousand chunks, that's a single matrix multiply that finishes before I've let go of the enter key. I genuinely did not need a vector database. A NumPy array and argsort is enough, and I'd encourage anyone with fewer than, say, a hundred thousand chunks to start there before reaching for Faiss or Qdrant.

Querying

def search(query, k=5):
    q = embed([query])[0]
    scores = vectors @ q                    # cosine, since all unit length
    top = np.argsort(scores)[::-1][:k]
    return [(chunks[i], float(scores[i])) for i in top]

That's it. The first time it returned the backpressure note from a query that didn't contain a single matching keyword, I sat back and grinned like an idiot. It works. It actually works.

A small refinement that paid off: I prepend a one-line title or filename to each chunk before embedding it. The model then has a bit of context about where the chunk came from, and "untitled paragraph three" stops competing with the paragraph that's actually about the thing. It's a cheap trick and it noticeably tightened the top results. I also tried adding a short instruction prefix to queries, the sort of "represent this sentence for retrieval" prompt some newer models want, but MiniLM wasn't trained for it and it made no measurable difference, so I dropped it. Worth knowing which models care and which don't before you cargo-cult a prefix from a blog post.

The other thing I'll flag, because it surprised me, is how forgiving the whole approach is to messy input. My notes are riddled with code fragments, half-finished sentences, and the occasional shopping list that wandered in. I expected those to pollute the index. In practice they just sit in their own quiet corner of the vector space and never surface unless I search for something genuinely similar. The model doesn't need clean prose to be useful, which is a relief, because clean prose is not what nine years of notes contains.

The bits that bit me

A few things weren't obvious from the outside.

Scores are not probabilities. A cosine of 0.42 doesn't mean "42% relevant". You learn the rough threshold for your own corpus by eyeballing results, and for me anything under about 0.3 was noise.
Embeddings go stale when you edit. I add new notes daily, so I keep a small manifest of file hashes and only re-embed what changed. Re-embedding the whole lot takes under a minute anyway, but the incremental path keeps it instant.
Semantic search is rubbish at exact identifiers. If I search for a specific error code or a hostname, the dumb old grep wins every time. So I kept both. The new tool sits alongside grep, it doesn't replace it.
The model's idea of "similar" is not always yours. Occasionally it confidently surfaces something that's topically adjacent but useless, two notes that both mention deployment, say, with nothing else in common. You learn to skim past these quickly, but it's a reminder that the vector space encodes the model's priors, not your intent.

On performance, since people always ask: embedding the full corpus from cold takes well under a minute on a laptop CPU, no GPU involved. Memory is trivial, a few thousand 384-dimension float32 vectors is single-digit megabytes. The model itself is about 90MB on disk and loads in a couple of seconds. None of this is straining the hardware, which is the whole appeal of going small. I could run a far larger embedding model and squeeze out better recall, but then I'd be waiting, and a search tool you wait for is a search tool you stop using. The MiniLM-sized sweet spot exists precisely because instant-and-good-enough beats slow-and-excellent for something you reach for fifty times a day.

That last point is the honest conclusion. Embeddings don't make keyword search obsolete, they cover the case keyword search was always bad at: when you remember the shape of a thing but not its words. I now have a fifty line script that does that, runs entirely on my own hardware, and has already paid for the afternoon I spent building it. The robot stays on the laptop, which is exactly where I want it.