on-device embeddings, or how i stopped paying per search

A small robot, vaguely menacing

I wanted semantic search over my notes, and I did not want every keystroke to become an API call to someone's billing meter. The obvious route in 2024 is to throw your text at a hosted embedding endpoint and store the vectors. It works, it is cheap per call, and it is also a small ongoing tax on a thing that should cost nothing once built. So I tried doing the whole lot on-device, and it turns out that is now entirely reasonable.

The vault is a few thousand Markdown files. The job is: turn each chunk into a vector, store the vectors, and at query time turn the query into a vector and find the nearest neighbours. None of that needs a datacentre.

the model

The embedding model is the only interesting choice. You want something small enough to run on a laptop CPU in reasonable time but good enough that the nearest-neighbour results actually mean something. A bge-small style model in the 384-dimension range is the sweet spot right now. It loads in a second, runs happily without a GPU, and the quality is genuinely fine for personal-scale retrieval.

A circuit board, because embeddings live somewhere

chunking is where the quality actually comes from

Everyone obsesses over the model and ignores the chunking, which is backwards. A whole note is too coarse: a query about one paragraph drags in the vector for nine unrelated ones. A single sentence is too fine and loses context. I settled on splitting on headings and then on paragraph boundaries, capping each chunk at a few hundred tokens, with a sentence or two of overlap so a thought that straddles a boundary is not cut in half.

def chunk(text, max_tokens=256, overlap=1):
    blocks = split_on_headings_and_paras(text)
    for block in blocks:
        for window in sliding_windows(block, max_tokens, overlap):
            yield window

storing the vectors

For a few thousand chunks you do not need a vector database. You need a file. I keep the vectors in a single array on disk and do a brute-force cosine search at query time. Even unoptimised, scanning a few thousand 384-dimension vectors is faster than the round trip to any hosted service would have been, and it never leaves the machine.

import numpy as np

def search(query_vec, vectors, k=8):
    sims = vectors @ query_vec  # vectors pre-normalised
    return np.argpartition(-sims, k)[:k]

The result is search that understands "how do I set up kdump" maps to a note titled "capturing kernel crash dumps", with no shared keywords at all. The whole thing runs offline, costs nothing per query, and the data stays mine. For personal-scale retrieval the case for sending it all to an API has quietly evaporated, and I am glad of it.