Theres probably some interesting ideas around tokenization and metadata as well. For example, if you’re processing the raw file I expect you want to strip out a lot of markup before tokenization of the content. Conversely, some markup like code blocks or examples would be meaningful for tokenization and embedding anyways.
I wonder if both of those ideas can be combined for something like automated footnotes and annotations. Linking or mouseover relevant content from elsewhere in the documentation.
But I don't see them when I filter the list for 'voyage'.
It is worth noting that their own published material [0] does not entail any score from any dataset from the mteb benchmark.
This may sound nit picky, but considering transformers' parroting capabilities, having seen test data during training should be expected to completely invalidate those scores.
[0] see excel spreadsheet linked here https://blog.voyageai.com/2024/09/18/voyage-3/
Could hurt performance in niche applications, in my estimation.
Looking forward to try the announced large models though.
Many of the top-performing models that you see on the MTEB retrieval for English and Chinese tend to overfit to the benchmark nowadays. voyage-3 and voyage-3-lite are also pretty small in size compared to a lot of the 7B models that take the top spots, and we don't want to hurt performance on other real-world tasks just to do well on MTEB.
Why should I pick voyage-3 if for all I know it sucks when it comes to retrieval accuracy (my personally most important metric)?
Nice!
Fortunately MTEB lets you sort by model parameter size because using 7B parameter LLMs for embeddings is just... Yuck.
It definitely also seems like there should be lots of ways to utilize "Cosine Similarity" (or other closeness algos) in databases and other information processing apps that we haven't really exploited yet. For example you could almost build a new kind of Job Search Service that matches job descriptions to job candidates based on nothing but a vector similarity between resume and job description. That's probably so obvious it's being done, already.
I really like the picture you are drawing with "semantic hashes"!
The point of the translating model in between would be that it would re weight each and every one of the values of the embedding, after being trained on a massive dataset of original text -> vector embedding for model A + vector embedding for model B. If you have billions of parameters trained to do this translation between just two specific models to start with, wouldn't this be in the realm of possible?
For example, the number of different unit vector combinations in a 1500 dimensional space is like the number of different ways of "ordering" the components, which is 5^4114 .
EDIT: And the point of that factorial is that even if the dimensions were "identical" across two different LLMs but merely "scrambled" (in ordering) there would be that large number to contend with to "unscramble".
The key word being ”almost”. Yes, you can get similarity matches between job requirements and candidate resumes, but those matches are not useful for the task of finding an optimal candidate for a job.
For example, say a job requires A and B.
Candidate 1 is a junior who has done some work with A, B and C.
Candidate 2 is a senior and knows A, B, C, D, E and F by heart. All are relevant to the job and would make 2 the optimal candidate, even though C–F are not explicitly stated in the job requirements.
Candidate 1 would seem a much better candidate than 2, because 1’s embedding vector is closer to the job embedding vector.
We don't know if Candidate 2 really "knows A, B, C, D, E and F by heart", just that they claim to. They could be adding whatever to their skill list just, even though they hardly used it, just because it' a buzzword.
So Candidate 1 could still blow them out of the water in performance, and even be able to trivially learn D, and E in a short while on the job if needed.
The skill vector wont tell much by itself, and even prevent finding the better candidate if its used for screening.
That is indeed a problem. I have been thinking about a possible solution to the very same problem for a while.
The fact: people lie on their resumes, and they do it for different reasons. There are white lies (e.g. pumps something up because they aspire to something but were not presented with an opportunity to do it, yet they are eager to skill themselves up, learn and do it, if given an opportunity). Then there are other lies. Generally speaking, lies are never black or white, true or false; they are a shade of grey.
So the best idea I have been able to come up with so far is a hybrid solution that entails the text embeddings (the skills similarity match and search) coupled with the sentiment analysis (to score the sincerity of the information stated on a resume) to gain an extra insight into the candidate's intentions. Granted, the sentiment analysis is an ethically murky area…
But insofar as detecting lies in sentences, that simply cannot be done, because even if it ever did work the failure rate would still be 99%, so you're better off flipping a coin.
People do this in ecommerce with the concept of user embeddings and product embeddings, where the result of personalized recommendations is just a user embedding search.
That statement is just flat out incorrect on it's face, however it did make me think of something I hadn't though of before, which is this:
Embedding vectors can be made to have a "scale" (multiplier) on specific terms which represent the amount of "weight" to add to that term. For example if I have 10 years experience in Java Web Development, then we can take the actual components of that vector embedding (i.e. for string "Java Web Development") and multiply them by some proportionality of 10, and that results in a vector that is "Further" into that direction. This represents an "amount" of directional into the Java Web direction.
So this means even with vector embeddings we can scale out to specific amounts of experience. Now here's the cool part. You can then take all THOSE scaled vectors (one for each individual job candidate skill) and average them to get a single point in space which CAN be compared as a single scalar distance from what the Job Requirements specify.
This is geometrically "meaningful", semantically. It would apply to not just a time vector (experience) but in other contexts it could mean other things. Like for example, money invested into a particular sector (Hedge fund apps).
This makes me realize we could design a new type of Perceptron (MLP) where specific scalars for particular things (money, time, etc.) could be wired into the actual NN architecture, in such a way that a specific input "neuron" would be fed a scalar for time, and a different neuron a scalar for money, etc. You'd have to "prefilter" each training input to generate the individual scalars, but then input them into the same "neuron" every time during training. This would have to improve overall "Intelligence" by a big amount.
Also, what a candidate claims as a skill is totally irrelevant and can be a lie. It is the work experience that matters, and skills can be extracted from it.
Text embeddings are not about matching, they are about extracting the semantic topics and the semantic context. Matching comes next, if required.
If a LLM is used to generate the text embeddings, it would «expand» the semantic context for each keyword. E.g. «GenAI» would make the LLM expand the term into directly and loosely related semantic topics, say, «LLM», «NLP» (with a lesser relevance though), «artificial intelligence», «statistics» (more distant) and so forth. The generated embeddings will result in a much richer semantic context that will allow for straightforward similarity search as well as for exploratory radial search with ease. It also works well across languages, provided the LLM had a linguistically and sufficiently diverse corpus it was trained on.
Fun fact: I have recently delivered a LLM assisted (to generate text embeddings) k-NN similarity search for a client of mine. For the hell of it, we searched for «the meaning of life» in Cantonese, English, Korean, Russian and Vietnamese.
It pulled up the same top search result across the entire dataset for the query in English, Korean and Russian. Effectively, it turned into a Babelfish of search.
Cantonese and Vietnamese versions diverged and were less relevant as the LLM did not have a substantial corpus in either language. This can be easily fixed in the future, once a new LLM version that will have been trained on a better corpus in both, Cantonese and Vietnamese, languages – by regenerating the text embeddings on the dataset. The implementation won't have to change.
I added semantic search, but I'm workin on adding resume upload/parsing to do automatic matching.
It just does cosine similarity with OpenAI embeddings + pgVector. It's not perfect by any means, but it's useful. It could probably stand to be improved with a re-ranker, but I just never got around to it.
I think the best "matching" factor is to minimize total distance where each distance is the time-multiplied vector for a specific skill.
Literally the next item on my roadmap for employbl dot com lol. we're calling it a "personalized job board" and using PGVector for storing the embeddings. I've also heard good things about Typesense though.
One thing I've found to be important when creating the embeddings is to not do an embedding of the whole job description. Instead use an LLM to make a concise summary of the job listing (location, skills etc.) in a structured format. Then store that store as the embedding. It reduces noise and increases accuracy for vector search.
One quick minor note is that the resulting embeddings for the same text string could be different, depending on what you specify the input type as for retrieval tasks (i.e. query or document) -- check out the `input_type` parameter here: https://docs.voyageai.com/reference/embeddings-api.
stella_en_1.5B_v5 seems to be an unsung hero model in that regard
plus you may not even want such large token sizes if you just need accurate retrieval of snippets of text (like 1-2 sentences)
The other day I found myself trying to figure out some common themes across a bunch of comments I was looking at. I felt lazy to go through all of them so I turned my attention to the "Sentence Transformers" lib. I converted each comment into a vector embedding, applied k-means clustering on these embeddings, then gave each cluster to ChatGPT to summarize the corresponding comments. I have to admit, it was fun doing this and saved me lots of time!
This is what people using embedding models for recsys are doing. It’s not rocket science and it doesn’t require “infinite data”.
By 100k samples I mean 100k samples that provide relevance feedback. 100k positive pairs.
I’m working on these kinds of problems with actual customers. Not really sure where the hostility is coming from.
1) that matryoshka representations work so well, and as few as 64 dimensions account for a large majority of the performance
2) that dimensional collapse is observed. Look at your cosine similarity scores and be amazed that everything is pretty similar and despite being a -1 to 1 scale, almost nothing is ever less than 0.8 for most models
I think we’re at the infancy in this technology, even with all of the advances in recent years.
I haven't found a tool that is more effective in helping me learn.
Chess has become even more popular despite computers that can “rob us” of the joy. They’re even better practice partners.
Most car owners would never say outright "I want a car-centric culture". But car manufacturers lobbied for it, and step by step, we got both the deployment of useful car infrastructure, and the destruction or ignoring of all amenities useful for people walking or cycling.
Now let's go back to the period where cars start to become enormously popular, and cities start to build neighborhoods without sidewalks. There was probably someone at the time complaining about the risk of cars overtaking walking and leading to stores being more far away etc. And in front of them was probably someone like you calling them a luddite and being oblivious of second order effects.
Your software development methodology is your own. Why does someone else’s use of a tool deprive you of doing things the way you want?
I too think embeddings are vastly underutilized, and chat interface is not the be-all, end-all (not to mention, "chat with your program/PDF/documentation" just sounds plain stupid). However, whether current AI tools are replacing or amplifying your intelligence, is entirely down to how you use them.
As for search, yes, that was a huge breakthrough and powerful amplifier. 2+ decades ago. At this point it's computer use 101 - which makes it sad when dealing with programs or websites that are opaque to search, and "ubiquitous local search" is still not here. Embeddings can and hopefully will give us better fuzzy/semantic searching, but if you push this far enough, you'll have to stop and ask - if the search tool is now capable to understand some aspects of my data, why not surface this understanding as a different view into data, instead of just invoking it in the background when user makes a search query?
There were issues an embedding model might not do well on where as the LLM could handle. for example: These were camel case words, like WoodPecker, AquafinaBottle, and WoodStock (I changed the words to not reveal private data). WoodPecker and WoodStock would end up with close embedding values because the word Wood dominated the embedding values, but these were supposed to go into 2 different categories.
When faced with a similar challenge we developed a custom tokenizer, pretrained BERT base model[0], and finally a SPLADE-esque sparse embedding model[1] on top of that.
I have been working on embeddings for a while.
For different reasons I have recently become very interested in learned sparse embeddings. So I am curious what led you to choose them for your application, and why?
I can provide what I can provide publicly. The first thing we ever do is develop benchmarks given the uniqueness of the nuclear energy space and our application. In this case it's FermiBench[0].
When working with operating nuclear power plants there are some fairly unique challenges:
1. Document collections tend to be in the billions of pages. When you have regulatory requirements to extensively document EVERYTHING and plants that have been operating for several decades you end up with a lot of data...
2. There are very strict security requirements - generally speaking everything is on-prem and hard air-gapped. We don't have the luxury of cloud elasticity. Sparse embeddings are very efficient especially in terms of RAM and storage. Especially important when factoring in budgetary requirements. We're already dropping in eight H100s (minimum) so it starts to creep up fast...
3. Existing document/record management systems in the nuclear space are keyword search based if they have search at all. This has led to substantial user conditioning - they're not exactly used to what we'd call "semantic search". Sparse embeddings in combination with other techniques bridge that well.
4. Interpretability. It's nice to be able to peek at the embedding and be able to get something out of it at a glance.
So it's basically a combination of efficiency, performance, and meeting users where they are. Our Fermi model series is still v1 but we've found performance (in every sense of the word) to be very good based on benchmarking and initial user testing.
I should also add that some aspects of this (like pretrained BERT) are fairly compute-intense to train. Fortunately we work with the Department of Energy Oak Ridge National Laboratory and developed all of this on Frontier[1] (for free).
[0] - https://huggingface.co/datasets/atomic-canyon/FermiBench
[1] - https://en.wikipedia.org/wiki/Frontier_(supercomputer)
I’m admittedly unfamiliar with the space, but having just done some reading that doesn’t look to be true. Can you elaborate please and maybe point to some external support for such a bold claim?
SOTA LLMs?
If you think about what, say, a chair or electricity or love are, or what it means for something to be something, etc., I believe you'll quickly realize that words and concepts don't have well-defined meanings. Rather, we define things in terms of other things, which themselves are defined in terms of other things, and so on. There's no atomic meaning, the meaning is in the relationships between the thought and other thoughts.
And that is exactly what those models capture. They're trained by consuming a large amount of text - but not random text, real text - and they end up positioning tokens as points in high-dimensional space. As you increase the number of dimensions, there's eventually enough of them that any relationship between any two tokens (or groups; grouping concepts out of tokens is just another relationship) can be encoded in the latent space as proximity along some vector.
You end up with real computational artifact that's implementing the idea of defining concepts only in terms of other concepts. Now, between LLMs and the ability to identify and apply arbitrary concepts with vector math, I believe that's as close to the idea of "understanding" as we've ever come.
Where I don’t follow is how a bitmap approximation captures that in a semiotic way. As far as I can tell the semiosis still is all occurring in the human observer of the machine’s output. The mathematics still hasn’t captured the interpretant so far as I can see.
Regardless of my possible incomprehension, I appreciate your elucidation. Thanks!
LLMs help me write code faster and understand new libraries, image generation helps me build sites and emails faster, etc
Training your own fine tune takes a really short time and GPU resources, and you can easily outperform even sota models on your specific problem with a smaller model/vector space
Then again on general English text and doing a basic fuzzy search. I would not really expect high performance gains.
This is a good time to chill for my visualization of 5 Millionembeddings of HN posts, users and comments: https://tomthe.github.io/hackmap/
scikit-learn also has options: https://scikit-learn.org/stable/auto_examples/manifold/plot_...
I found clustering by topic was hard, because tone dimensions ( whatever they were ) seemed to dominate.
How can you pull apart the embeddings? Maybe use an LLM to extract a topic, and then cluster by extracted topic?
In the end I found it easier to just ask an LLM to group articles by topic.
Potentially solves your issue, but it is also handy when you have to chunk a larger document and would lose context from calculating the embedding just on the chunk.
Is there a method for dimensionality reduction of embeddings for different applications? Let's say I'm building a system to find similar tech support conversations and I am only interested in the content of the discussion, not the tone of it.
How could I derive an embedding that represents only content and not tone?
To adapt this to your problem of ignoring writing style in queries, you could collect a few text samples with different writing styles but same content to compute a "style direction". Then when you do a query for some specific content, subtract the projection of your query embedding onto the style direction to eliminate the style:
query_without_style = query - dot(query, style_direction) * style_direction
I suspect this also works with text embeddings, but you might have to train the embedding network in some special way to maximize the effectiveness of embedding arithmetic. Vector normalization might also be important, or maybe not. Probably depends on the training.Another approach would be to compute a "content direction" instead of a "style direction" and eliminate every aspect of a query that is not content. Depending on what kind of texts you are working with, data collection for one or the other direction might be easier or have more/fewer biases.
And if you feel especially lazy when collecting data to compute embedding directions, you can generate texts with different styles using e.g. ChatGPT. This will probably not work as well as carefully handpicked texts, but you can make up for it with volume to some degree.
They are still a “direction” in the way that [0.5, 0.5] in x,y space is a 45 degree angle, and in that direction it has a magnitude of around 0.7
So of course you could probably define some other vector space where many of the different labeled vectors are translated to magnitudes in the original embedding space, letting you do things like have a “tone” slider.
I'd say this assumption is both right and wrong. Wrong, because it's unlikely there's a direction in embedding space corresponding to a platonic ideal of "tone". Right, because I suspect that, for sufficiently large embedding space (on the order of what goes into current LLMs), any continuous concept we can articulate will have a corresponding direction in the embedding space, that's roughly as sharp as our ability to precisely define the concept.
With 281M params it's also relatively small (at least for an embedding model) so one can play with it relatively easily.
We have customers doing this in production in other contexts.
If you have fundamentally different access patterns (e.g. doc -> doc retrieval instead of query -> doc retrieval) then it's often time to just maintain another embedding index with a different model.
In your case it would be to take a bunch of texts which roughly mean the same thing but with variance in tone, compute PCA of the normalized embeddings, take the top axsis (or top few) and project it out (ie subtract the projection) of the embeddings for the documents you care about before doing the cosine similarity.
Something along those lines.
Could be it's a terrible idea, haven't had time to do much with it yet due to work.
[1]: https://en.wikipedia.org/wiki/Principal_component_analysis
I've found all of the RAG applications I've tried to be pretty underwhelming, but semantic search itself (especially combined with full-text search) is very cool.
[0] https://nonint.com/2023/10/18/is-the-reversal-curse-a-genera...
In principle, I agree with much of the sentiment here. Embeddings can get you pretty far. If the goal is to find information and citations/links, you can accomplish most of that with a simple embeddings/vector search.
GenAI does have an upside in that it can distill and process those results into something more refined. One of the main production use cases is retrieval augmented generation (RAG). The "R" is usually a vector search but doesn't have to be.
As we see with things like ChatGPT search and Perplexity, there is a push towards using LLMs to summarize the results but also linking to the results to increase user confidence. Even Google Search now has that GenAI section at the top. In general, users just aren't going to accept LLM responses without source citations at this point. The question is if the summary provides value or if the citations really provide the most value. If it's the later, then Embeddings will get the job done.
It is a nice read though - explaining the basics of vector spaces, similarity and how it is used in modern ML applications.
https://news.ycombinator.com/item?id=42014036
> I didn't see the strong argument highlighting what powerful feature exactly people were missing in relation to embeddings
I had to leave out specific applications as "an exercise for the reader" for various reasons. Long story short, embeddings provide a path to make progress on some of the fundamental problems of technical writing.
> I had to leave out specific applications as "an exercise for the reader" this is very unfortunate. would be very interesting to hear some intel :)
I'm shocked at the number of startups, etc you see trying to do RAG, etc that basically have no idea what they are, how they actually work, etc.
The "R" in RAG stands for retrieval - as in the entire field of information retrieval. But let's ignore that and skip right to the "G" (generative)...
Garbage in, garbage out people!
I made an episode to appreciate the book: https://podcasters.spotify.com/pod/show/podgenai/episodes/Th...
0. Generate an embedding of some text, so that you have a known good embedding, this will be your target.
1. Generate an array of random tokens the length of the response you want.
2. Compute the embedding of this response.
3. Pick a random sub-section of the response and randomize the tokens in it again.
4. Compute the embedding of your new response.
5. If the embeddings are closer together, keep your random changes, otherwise discard them, go back to step 2.
6. Repeat this process until going back to step 2 stops improving your score. Also you'll probably want to shrink the size of the sub-section you're randomizing the closer your computed embedding is to your target embedding. Also you might be able to be cleverer by doing some kind of masking strategy? Like let's say the first half of your response text already was actually the true text of the target embedding. An ideal randomizer would see that randomizing the first half almost always makes the result worse, and so would target the 2nd half more often (I'm hoping that embeddings work like this?).
7. Do this N times and use an LLM to score and discard the worst N-1 results. I expect that 99.9% of the time you're basically producing adversarial examples w/ this strategy.
8. Feed this last result into an LLM and ask it to clean it up.
https://www.ethicalads.io/blog/2024/04/using-embeddings-in-p...
That is mostly just that we don’t want folks going up and doing a 30 minute demo of Sphinx or something :-)
> I don’t know. After the model has been created (trained), I’m pretty sure that generating embeddings is much less computationally intensive than generating text. But it also seems to be the case that embedding models are trained in similar ways as text generation models2, with all the energy usage that implies. I’ll update this section when I find out more.
Although I do care about the environment, this question is completely the wrong one if you ask me. There is the public opinion (mainstream media?) some kind of idea that we should use less AI and somehow this would solve our climate problems.
As a counterexample, let's go to the extreme. Let's ban Google Maps because it does take computational resources from the phone. As a result more people will take wrong routes, and thus use more petrol. Say you use one gallon of petrol extra, that then wastes 34 kWh. This is of course the equivalent of running 34 powerful vacuum cleaners on full power for an hour. In contrast, say you downloaded your map, then the total "cost" is only the power used by the phone. A mobile phone has a battery of about 4 mAh, so 0,004 Ah * 4.2 V = 0.168 W, or 0.000168 kW. This means that the phone is about 200 000 times as efficient! And then we didn't even consider the time-saving for the human.
It's the same with running embeddings for doc generation. An Nvidia H100 consumes about 700 W, so say 1 kWh after an hour of running. 1 kWh should be enough to do a bunch of embedding runs. If this then saves, for example, one workday including the driving back and forth to the office, then again the tradeoff is highly in favor of the compute.
I lived somewhere that would get to 40c in the summers and an oscillating fan was good enough to keep cool, the AC was nice to have but it wasn’t necessary.
I find it very hypocritical when people tell you to change your lifestyle for climate change but they have the A/C blasting all day long, everyday.
If it's not too humid.
I've lived in places with relatively high humidity and 35-40C, and have had the misfortune of not having an AC. Fans are not enough. I mean, sure, you can survive, but it really, really sucks.
This is entirely meaningless without providing the humidity. At higher than 70% relative humidity 40C is potentially fatal.
As others have said, that works if you lived in a very dry area. And perhaps it was a house or a building optimized for airflow. And you didn't have much to do during the day. And I'm guessing you're young, and healthy.
Here in central Europe, sustained 40°C in the summer would rack up a significant body count. Anything above 30°C sucks really bad if you have any work to do. Or if, say, you're pregnant or have small children. And you live in a city.
Residential A/C isn't a luxury anymore, it's becoming a necessity very fast. Fortunately, heat pumps are one of the most efficient inventions of humankind. In particular, "A/C blasting all day long" beats anything else you could do to mitigate the heat if it involved getting into a car. And then it also beats whatever else you're doing to heat your place during the winter.
What remains though, is that increased productivity has rarely lead to a decrease in energy usage. Whether energy scarcity will drive model optimisation is anyone's guess, but it would be a differentiating feature on a market saturated with similarly capable offerings.
They're not the silver bullet many initially hoped for, they're not a complete replacement for simpler methods like BM25. They only have very limited "semantic understanding" (and as people throw increasingly large chunks into embedding models, the meanings can get even fuzzier)
Overly high expectations lets people believe that embeddings will retrieve exactly what they mean, and With larger top-k values and LLMs that are exceptionally good at rationalizing responses, it can be difficult to notice mismatches unless you examine the results closely.
Let's just say that if you think off-the-shelf embedding models are going to work well with this kind of highly specialized content you're going to have a rough time.
There are embedding approaches that balance "semantic understanding" with BM25-ish.
They're still pretty obscure outside of the information retrieval space but sparse embeddings[0] are the "most" widely used.
I think sparse retrieval with cross encoders doing reranking is still significantly better than embeddings. Embedding indexes are also difficult to scale since hnsw consumes too much memory above a few million vectors and ivfpq has issues with recall.
> As docs site owners, I wonder if we should start freely providing embeddings for our content to anyone who wants them, via REST APIs or well-known URIs. Who knows what kinds of cool stuff our communities can build with this extra type of data about our docs?
Interesting idea. You'd have to specify the exact embedding model used to generate an embedding, right? Is there a well understood convention for such identification like say model_name:model_version:model_hash or something? For technical docs, obviously very broad field, is there an embedding model (or small number) widely used or obviously highly suitable that a site ownwer could choose one and have some reasonable expectation that publishing embeddings for their docs generated using that model would be useful to others? (Naive questions, I am not embedded in the field.)
Edit: This value has now been fixed in the article.
[1] https://platform.openai.com/docs/models/embeddings#embedding...
[2] https://platform.openai.com/docs/guides/embeddings/#embeddin...
Also, what each model means by a token can be very different due to the use of different model-specific encodings, so ultimately one must compare the number of characters, not tokens.
* https://news.ycombinator.com/item?id=42014683
* https://news.ycombinator.com/item?id=42015282
Updating that section now
Even wrote about it at: https://blog.dobror.com/2024/08/30/how-embeddings-make-your-...
I am not denying their usefulness, but it's misleading.