API constant is text-embedding-3
I don’t understand how the math works out on those vectors
Not an expert, but I believe now that we can fit more tokens into an LLM's context window, we can avoid a number of problems by providing additional context around any chunk of text that might be useful to the LLM. Solves the problem of misinterpretation of the important bit by the LLM.
The benefit is that each sentence’s embedding is informed by all of the other sentences in the context. So when a sentence refers to “The company” for example, the sentence embedding will have captured which company that is based on the other sentences in the context.
This technique is called ‘late chunking’ [1], and is based on another technique called ‘late interaction’ [2].
And you can combine late chunking (to pool token embeddings) with semantic chunking (to partition the document) for even better retrieval results. For an example implementation that applies both techniques, check out RAGLite [3].
[1] https://weaviate.io/blog/late-chunking
[2] https://jina.ai/news/what-is-colbert-and-late-interaction-an...
My best guess so far is that somehow I embed a long text and then I break up the returned embedding into multiple parts and search each separately? But that doesn't sound right.
A transformer-based embedding model doesn’t just give you a vector for the entire input string, it gives you vectors for each token. These are then “pooled” together (eg averaged, or max-pooled, or other strategies) to reduce these many vectors down into a single vector.
Late chunking means changing this reduction to yield many vectors instead of just one.
Then, you want to parition the document into chunks. Late chunking pairs really well with semantic chunking because it can use late chunking's improved sentence embeddings to find semantically more cohesive chunks. In fact, you can cast this as a binary integer programming problem and find the ‘best’ chunks this way. See RAGLite [1] for an implementation of both techniques including the formulation of semantic chunking as an optimization problem.
Finally, you have a sequence of document chunks, each represented as a multi-vector sequence of sentence embeddings. You could choose to pool these sentence embeddings into a single embedding vector per chunk. Or, you could leave the multi-vector chunk embeddings as-is and apply a more advanced querying technique like ColBERT's MaxSim [2].
[1] https://github.com/superlinear-ai/raglite
[2] https://huggingface.co/blog/fsommers/document-similarity-col...
Does it somehow capture _all_ of the ideas, and querying for a single one would somehow match?
Isn't that the point of breaking down into sentences?
Someone mentioned adding context -- but doesn't it calculate embedding on the whole thing? The API Docs list `input` but no separate `context`. https://docs.voyageai.com/reference/embeddings-api
I'm also quite interested in the nuts and bolts: does anyone know what the current accepted leaderboard on this is? I was screwing around with GritLM [1] a few months back and I seem to remember the MTEB [2] was kind of the headline thing at that time, but I might be out of date.
[1] https://arxiv.org/pdf/2402.09906 [2] https://huggingface.co/blog/mteb
so the same link has been posted ~10 times in the last one month?
and this is the first time the post got any attention
mixed feelings there
This way I can tune what aspects of the doc I want to focus retrieval on, it's easier to determine when there are any data quality issues that need to be fixed, and the summaries have turned out to be useful for other use cases in the company.
I was critical about these guys before (not about their quality of work but rather about building a business around embeddings). This work though seems interesting and I might even give it a try, esp if they provide a fine-tuning API (is that on the roadmap?)
One year ago simonw said this in a post about embeddings:
[https://news.ycombinator.com/item?id=37985489]
> Lots of startups are launching new “vector databases”—which are effectively databases that are custom built to answer nearest-neighbour queries against vectors as quickly as possible.
> I’m not convinced you need an entirely new database for this: I’m more excited about adding custom indexes to existing databases. For example, SQLite has sqlite-vss and PostgreSQL has pgvector.
Do we still feel specialized vector databases are an overkill?
We have AWS promoting amazon OpenSearch as the default vector database for a RAG knowledge base and that service is not cheap.
Also I would like to understand a bit more about how to pre-process and chunk the data properly in a way that optimizes the vector embeddings, storage and retrieval ... any good guides on the same i cna refer to? Thanks!
1. It depends on how much embeddings we are talking about. Few millions, probably yes, 100s millions/Billions range? You likely need something custom.
2. Vectors are only one way to search for things. If your corpus contains stuff that don't carry semantic weight (think about part numbers) and you want to find the chunk that contains that information you'll likely need something that uses tf-idf.
3. Regarding chunk size, it really depends on your data and the queries your users will do. The denser the content the smaller the chunk size.
4. Preprocessing - again, it depends. If it's PDFs with just texts, try to remove footers / headers from the extracted text. Of it contains tables, look at something like table former to extract a good html representation. Clean up other artifacts from the text (like dashes for like breaking, square brackets with reference numbers for scientific papers, ... ).
My impression is that it might be best to do vector index construction separately from the rest of the data, for performance reasons. It seems vector indexes are several orders of magnitude more compute intensive than most other database operations.
For example I’m using duckDB as a vector store for similarity search and RAG. It works really well.
As mentioned by another comment, an advantage of using a separate vector store (on different hardware) is that (re-)building vector indices can cause high CPU load and therefore latency for regular queries to go up.
RAGs are the ControlNet of image diffusion. They exist for many reasons, some of those are that context windows are small, instruct-style frontier models haven’t been adequately trained on search tasks, and reason #1: people say they need RAGs so an industry sprouts up to give it to them.
Do we need RAGs? I guess for now yes, but in the near future no: 2/3 reasons will be solved by improvements to frontier models that are eminently doable and probably underway already. So let’s answer the question for controlnets instead to illuminate why just because someone asks for something, doesn’t mean it makes any sense.
If you’re Marc Andreesen and you call Mike Ovitz, your conversation about AI art generation is going to go like this: “Hollywood people tell me that they don’t want the AI to make creative decisions, they want AI for VFX or to make short TikTok videos or” something something, “the most important people say they want tools that do not obsolete them.” This trickles down to the lowly art director, who may have an art illustration background but who is already sending stuff overseas to be done in something that resembles functionally a dehumanized art generator. Everybody up and down this value chain has no math or English Lit background so to them, the simplest, most visual UX that doesn’t threaten their livelihood is what they want: Sketch To Image.
Does Sketch to image make sense? No. I mean it makes sense for people who cannot be fucked to do the absolutely minimal amount of lift to write prompts, which is many art professionals who, for the worse, have adopted “I don’t write” as an identity, not merely some technical skill specialization. But once you overcome this incredibly small obstacle of writing 25 words to Ideogram instead of 3 words to Stable Diffusion, it’s obvious: nobody needs to draw something and then have a computer finish it. Of course it’s technologically and scientifically tractable to have all the benefits of controlnets like, well control and consistency, but with ordinary text. But people who buy software want something that is finished, they are not waiting around for R&D projects. They want some other penniless creative to make a viral video using Ideogram or they want their investor’s daughter’s heir boyfriend who is their boss to shove it down their throats.
This is all meant to illustrate that you should not be asking people who don’t know anything what technology they want. They absolutely positively will say “faster horses.” RAGs are faster horses!