Embeddings are underrated
363 points by misonic 6 days ago | 175 comments
  • kaycebasques 6 days ago |
    Cool, first time I've seen one of my posts trend without me submitting it myself. Hopefully it's clear from the domain name and intro that I'm suggesting technical writers are underrating how useful embeddings can be in our work. I know ML practitioners do not underrate them.
    • dartos 6 days ago |
      Yeah embeddings are the unsung killer feature of LLMs
    • donavanm 6 days ago |
      You might want to highlight chunking and how embeddings can/should represent subsections of your document as well. It seems relevant to me for cases like similarity or semantics search, getting the reader to the relevant portion of the document or page.

      Theres probably some interesting ideas around tokenization and metadata as well. For example, if you’re processing the raw file I expect you want to strip out a lot of markup before tokenization of the content. Conversely, some markup like code blocks or examples would be meaningful for tokenization and embedding anyways.

      I wonder if both of those ideas can be combined for something like automated footnotes and annotations. Linking or mouseover relevant content from elsewhere in the documentation.

      • MrGreenTea 6 days ago |
        Do you have any resources you recommend for representing sub sections? I'm currently prototyping a note/thoughts editor where one feature is suggesting related documents/thoughts (think linked notes in Obsidian) for which I would like to suggest sub sections and not only full documents.
        • donavanm 6 days ago |
          Sorry, no good references off hand. I’ve had to help write & generate public docs in DocBook in the past. But no expert on either editors, nlp, or embeddings besides hacking around some tools for my own note taking. My assumption is youll want to use your existing markup structure, if you have it. Or naively split on paragraphs with a tool like spacy. Or get real fancy and use dynamic ranges; something like an accumulation window that aggregates adjacent sentences based on individual similarity, break on total size or dissimilarity, and then treat that aggregate as the range to “chunk.”
          • MrGreenTea 5 days ago |
            Thanks for the elaborate and helpful response. I'm also hacking on this as a personal note taking project and already started playing around with your ideas. Thanks!
    • enjeyw 6 days ago |
      Haha yeah I was about to comment that I recall a period just after Word2Vec came out where embeddings were most definitely not underrated but rather the most hyped ML thing out there!
  • rahimnathwani 6 days ago |
    I'm not sure why the voyage-3 models aren't on the MTEB leaderboard. The code for the leaderboard suggests they should be there: https://huggingface.co/spaces/mteb/leaderboard/commit/b7faae...

    But I don't see them when I filter the list for 'voyage'.

    • newrotik 6 days ago |
      It is unclear this model should be on that leaderboard because we don't know whether it has been trained on mteb test data.

      It is worth noting that their own published material [0] does not entail any score from any dataset from the mteb benchmark.

      This may sound nit picky, but considering transformers' parroting capabilities, having seen test data during training should be expected to completely invalidate those scores.

      [0] see excel spreadsheet linked here https://blog.voyageai.com/2024/09/18/voyage-3/

      • jdthedisciple 6 days ago |
        I'm critical of the low number of embedding dims.

        Could hurt performance in niche applications, in my estimation.

        Looking forward to try the announced large models though.

    • fzliu 6 days ago |
      (I work at Voyage)

      Many of the top-performing models that you see on the MTEB retrieval for English and Chinese tend to overfit to the benchmark nowadays. voyage-3 and voyage-3-lite are also pretty small in size compared to a lot of the 7B models that take the top spots, and we don't want to hurt performance on other real-world tasks just to do well on MTEB.

      • jdthedisciple 6 days ago |
        It would still be great to know how it compares?

        Why should I pick voyage-3 if for all I know it sucks when it comes to retrieval accuracy (my personally most important metric)?

        • fzliu 6 days ago |
          We provide retrieval metrics for a variety of datasets and languages: https://blog.voyageai.com/2024/09/18/voyage-3/. I also personally encourage folks to either test on their own data or to find an open source dataset that closely resembles the documents they are trying to search (we provide a ton of free tokens for the evaluating our models).
      • kkielhofner 6 days ago |
        > we don't want to hurt performance on other real-world tasks just to do well on MTEB

        Nice!

        Fortunately MTEB lets you sort by model parameter size because using 7B parameter LLMs for embeddings is just... Yuck.

  • quantadev 6 days ago |
    That was a good post. Vector Embeddings are in some sense a summary of a doc that's unique similar to a hashcode of a doc. It makes me think it would be cool if there were some universal standard for generating embeddings, but I guess they'll be different for each AI model, so they can't have the same kind of "permanence" hash codes have.

    It definitely also seems like there should be lots of ways to utilize "Cosine Similarity" (or other closeness algos) in databases and other information processing apps that we haven't really exploited yet. For example you could almost build a new kind of Job Search Service that matches job descriptions to job candidates based on nothing but a vector similarity between resume and job description. That's probably so obvious it's being done, already.

    • kqr 6 days ago |
      For one point of inspiration, see https://entropicthoughts.com/determining-tag-quality

      I really like the picture you are drawing with "semantic hashes"!

      • quantadev 6 days ago |
        Yeah for "Semantic Hashes" (that's a good word for them!) we'd need some sort of "Canonical LLM" model that isn't necessarily used for inference, nor does it need to even be all that smart, but it just needs to be public for the world. It would need to be updated like every 2 to 5 years tho to account for new words or words changing meaning? ...but maybe could be updated in such a way as to not "invalidate" prior vectors, if that makes sense? For example "ride a bicycle" would still point in the same direction even after a refresh of the canonical model? It seems like feeding the same training set could replicate the same model values, but there are nonlinear instabilities which could make it disintegrate.
        • kqr 5 days ago |
          Maybe the embedding could be paired up with a set of words that embed to somewhere close to the original embedding? Then the embedding can be updated for new models by re-embedding those words. (And it would be more interpretible by a human.)
          • quantadev 5 days ago |
            I mean it was just a thought I had. May be a "solution in search of a problem". I generate those a lot! haha. But it seems to me like having some sort of canonical set of training data and a canonical LLM architecture, we'd end up able to generate consistent embeddings of course, but I'm just not sure what the use cases are.
    • helloplanets 6 days ago |
      I guess it might be possible to retroactively create an embeddings model which could take several different models' embeddings, and translate them into the same format.
      • genuinelydang 6 days ago |
        No. That’s like saying you can transplant a person’s neuronal action potentials into another person’s brain and have it make sense to them.
        • helloplanets 6 days ago |
          That metaphor is skipping the most important part in between! You wouldn't be transplanting anything directly, you'd have a separate step in between, which would attempt to translate these action potentials.

          The point of the translating model in between would be that it would re weight each and every one of the values of the embedding, after being trained on a massive dataset of original text -> vector embedding for model A + vector embedding for model B. If you have billions of parameters trained to do this translation between just two specific models to start with, wouldn't this be in the realm of possible?

          • quantadev 6 days ago |
            A translation between models doesn't seem possible because there are actually no "common dimensions" at all between models. That is, each dimension has a completely different semantic meaning, in different models, but also it's the combination of dimension values that begin to impart real "meaning".

            For example, the number of different unit vector combinations in a 1500 dimensional space is like the number of different ways of "ordering" the components, which is 5^4114 .

            EDIT: And the point of that factorial is that even if the dimensions were "identical" across two different LLMs but merely "scrambled" (in ordering) there would be that large number to contend with to "unscramble".

        • tempusalaria 6 days ago |
          This is very similar to how LLMs are taught to understand images in llava style models (the image embeddings are encoded into the existing language token stream)
      • batch12 6 days ago |
        This is definitely possible. I made something like this. It worked pretty well for cosine similarity in my testing.
      • nostrebored 6 days ago |
        This is done with two models in most standard biencoder approaches. This is how multimodal embedding search works. We want to train a model such that the location of the text embeddings that represent an item and the image embeddings for that item are colocated.
    • genuinelydang 6 days ago |
      ”you could almost build a new kind of Job Search Service that matches job descriptions to job candidates”

      The key word being ”almost”. Yes, you can get similarity matches between job requirements and candidate resumes, but those matches are not useful for the task of finding an optimal candidate for a job.

      For example, say a job requires A and B.

      Candidate 1 is a junior who has done some work with A, B and C.

      Candidate 2 is a senior and knows A, B, C, D, E and F by heart. All are relevant to the job and would make 2 the optimal candidate, even though C–F are not explicitly stated in the job requirements.

      Candidate 1 would seem a much better candidate than 2, because 1’s embedding vector is closer to the job embedding vector.

      • coldtea 6 days ago |
        Even that is just static information.

        We don't know if Candidate 2 really "knows A, B, C, D, E and F by heart", just that they claim to. They could be adding whatever to their skill list just, even though they hardly used it, just because it' a buzzword.

        So Candidate 1 could still blow them out of the water in performance, and even be able to trivially learn D, and E in a short while on the job if needed.

        The skill vector wont tell much by itself, and even prevent finding the better candidate if its used for screening.

        • quantadev 6 days ago |
          So your point is that LLMs can't tell when job candidates are lying on their resume? Well that's true, but neither can humans. lol.
        • inkyoto 5 days ago |
          > We don't know if Candidate 2 really "knows A, B, C, D, E and F by heart", just that they claim to. They could be adding whatever to their skill list just, even though they hardly used it, just because it' a buzzword.

          That is indeed a problem. I have been thinking about a possible solution to the very same problem for a while.

          The fact: people lie on their resumes, and they do it for different reasons. There are white lies (e.g. pumps something up because they aspire to something but were not presented with an opportunity to do it, yet they are eager to skill themselves up, learn and do it, if given an opportunity). Then there are other lies. Generally speaking, lies are never black or white, true or false; they are a shade of grey.

          So the best idea I have been able to come up with so far is a hybrid solution that entails the text embeddings (the skills similarity match and search) coupled with the sentiment analysis (to score the sincerity of the information stated on a resume) to gain an extra insight into the candidate's intentions. Granted, the sentiment analysis is an ethically murky area…

          • quantadev 4 days ago |
            Sincerity score on a resume? I can't tell if you're joking or not. I mean yeah, any sentence that ends in something like "...yeah, that's the ticket." would be detectable for sure, but I'm not sure everyone is as bad a liar as Jon Lovitz.
            • inkyoto 4 days ago |
              Are you speaking hypothetically or from your own experience? The sentiment analysis is a thing, and it mostly works – I have tested it with satisfactory results on sample datasets. It is relatively easy to extract the emotional context from a corpus of text, less so when it comes to resumes due to their inherently more condensed content. Which is precisely why I mentioned ethical considerations in my previous response. With the extra effort and fine tuning, it should be possible to overcome most of the false negatives though.
              • quantadev 4 days ago |
                Sure AI can detect emotional tones (being positive, being negative, even sarcasm sometimes) in writing, so if you mean something like detecting negativity in a resume so it can be thrown immediately in the trash, then I agree that can work. Any negative emotionality is always a red-flag.

                But insofar as detecting lies in sentences, that simply cannot be done, because even if it ever did work the failure rate would still be 99%, so you're better off flipping a coin.

      • nostrebored 6 days ago |
        That's not accurate. You can explicitly bake in these types of search behaviors with model training.

        People do this in ecommerce with the concept of user embeddings and product embeddings, where the result of personalized recommendations is just a user embedding search.

      • quantadev 6 days ago |
        > not useful for the task of finding an optimal candidate

        That statement is just flat out incorrect on it's face, however it did make me think of something I hadn't though of before, which is this:

        Embedding vectors can be made to have a "scale" (multiplier) on specific terms which represent the amount of "weight" to add to that term. For example if I have 10 years experience in Java Web Development, then we can take the actual components of that vector embedding (i.e. for string "Java Web Development") and multiply them by some proportionality of 10, and that results in a vector that is "Further" into that direction. This represents an "amount" of directional into the Java Web direction.

        So this means even with vector embeddings we can scale out to specific amounts of experience. Now here's the cool part. You can then take all THOSE scaled vectors (one for each individual job candidate skill) and average them to get a single point in space which CAN be compared as a single scalar distance from what the Job Requirements specify.

        • genuinelydang 5 days ago |
          Then you would have to renormalize the vectors. You really really want to keep the range -1..1 because that is a special case where cosine similarity equals dot product equals Euclidean distance.
          • quantadev 5 days ago |
            I meant the normalized hyperspace direction (unit vector) represents a particular "skill" and the distance into that direction (extending outside the unit hypersphere) is years of experience.

            This is geometrically "meaningful", semantically. It would apply to not just a time vector (experience) but in other contexts it could mean other things. Like for example, money invested into a particular sector (Hedge fund apps).

            This makes me realize we could design a new type of Perceptron (MLP) where specific scalars for particular things (money, time, etc.) could be wired into the actual NN architecture, in such a way that a specific input "neuron" would be fed a scalar for time, and a different neuron a scalar for money, etc. You'd have to "prefilter" each training input to generate the individual scalars, but then input them into the same "neuron" every time during training. This would have to improve overall "Intelligence" by a big amount.

      • OutOfHere 6 days ago |
        The trick is evaluate the score for each skill, also weighing it by the years of experience with the skill, then sum the evaluations. This will address your problem 100%.

        Also, what a candidate claims as a skill is totally irrelevant and can be a lie. It is the work experience that matters, and skills can be extracted from it.

      • inkyoto 5 days ago |
        > The key word being ”almost”. Yes, you can get similarity matches between job requirements and candidate resumes, but those matches are not useful for the task of finding an optimal candidate for a job.

        Text embeddings are not about matching, they are about extracting the semantic topics and the semantic context. Matching comes next, if required.

        If a LLM is used to generate the text embeddings, it would «expand» the semantic context for each keyword. E.g. «GenAI» would make the LLM expand the term into directly and loosely related semantic topics, say, «LLM», «NLP» (with a lesser relevance though), «artificial intelligence», «statistics» (more distant) and so forth. The generated embeddings will result in a much richer semantic context that will allow for straightforward similarity search as well as for exploratory radial search with ease. It also works well across languages, provided the LLM had a linguistically and sufficiently diverse corpus it was trained on.

        Fun fact: I have recently delivered a LLM assisted (to generate text embeddings) k-NN similarity search for a client of mine. For the hell of it, we searched for «the meaning of life» in Cantonese, English, Korean, Russian and Vietnamese.

        It pulled up the same top search result across the entire dataset for the query in English, Korean and Russian. Effectively, it turned into a Babelfish of search.

        Cantonese and Vietnamese versions diverged and were less relevant as the LLM did not have a substantial corpus in either language. This can be easily fixed in the future, once a new LLM version that will have been trained on a better corpus in both, Cantonese and Vietnamese, languages – by regenerating the text embeddings on the dataset. The implementation won't have to change.

    • rasulkireev 6 days ago |
      I tried doing something like that: https://gettjalerts.com/

      I added semantic search, but I'm workin on adding resume upload/parsing to do automatic matching.

    • SCUSKU 6 days ago |
      It does exist! I built this for the monthly Who's Hiring threads: https://hnresumetojobs.com/

      It just does cosine similarity with OpenAI embeddings + pgVector. It's not perfect by any means, but it's useful. It could probably stand to be improved with a re-ranker, but I just never got around to it.

      • quantadev 6 days ago |
        Very cool. I knew it was too obvious an idea to be missed! Did you read my comments below about how you can maybe "scale up" a vector based on number of years of experience. I think that will work. It makes somebody with 10 yrs Java Experience closer to the target than someone with only 5yrs, if the target is 10 years! -- but the problem is someone with 20yrs looks even worse when they should look better! My problem in my life. hahaha. Too much experience.

        I think the best "matching" factor is to minimize total distance where each distance is the time-multiplied vector for a specific skill.

    • connor11528 3 days ago |
      > For example you could almost build a new kind of Job Search Service that matches job descriptions to job candidates based on nothing but a vector similarity between resume and job description. That's probably so obvious it's being done, already.

      Literally the next item on my roadmap for employbl dot com lol. we're calling it a "personalized job board" and using PGVector for storing the embeddings. I've also heard good things about Typesense though.

      One thing I've found to be important when creating the embeddings is to not do an embedding of the whole job description. Instead use an LLM to make a concise summary of the job listing (location, skills etc.) in a structured format. Then store that store as the embedding. It reduces noise and increases accuracy for vector search.

  • Aeolun 6 days ago |
    Is there some way to compare different embeddings for different use cases?
    • jdthedisciple 6 days ago |
      Search for MTEB Leaderboard on huggingface
  • fzliu 6 days ago |
    Great post!

    One quick minor note is that the resulting embeddings for the same text string could be different, depending on what you specify the input type as for retrieval tasks (i.e. query or document) -- check out the `input_type` parameter here: https://docs.voyageai.com/reference/embeddings-api.

  • thund 6 days ago |
    Doesn’t OpenAI embedding model support 8191/8192 tokens? That aside, declaring a winner by token size is misleading. There are more important factors like cross language support and precision for example
    • jdthedisciple 6 days ago |
      Yep, voyage-3 is not even anywhere in the top of the MTEB leaderboard if you order by `retrieval score` desc.

      stella_en_1.5B_v5 seems to be an unsung hero model in that regard

      plus you may not even want such large token sizes if you just need accurate retrieval of snippets of text (like 1-2 sentences)

      • kaycebasques 6 days ago |
        Thanks thund and jdthedisciple for these points and corrections. I'll update the section today.
        • kaycebasques 6 days ago |
          Updated the section to refer to the "Retrieval Average" column of the MTEB leaderboard. Is that the right column to refer to? Can someone link me to an explanation of how that benchmark works? Couldn't find a good link on it
    • OutOfHere 6 days ago |
      And that's not all because token encodings of different models can be very different.
  • nerdright 6 days ago |
    Great post indeed! I totally agree that embeddings are underrated. I feel like the "information retrieval/discovery" world is stuck using spears (i.e., term/keyword-based discovery) instead of embracing the modern tools (i.e., semantic-based discovery).

    The other day I found myself trying to figure out some common themes across a bunch of comments I was looking at. I felt lazy to go through all of them so I turned my attention to the "Sentence Transformers" lib. I converted each comment into a vector embedding, applied k-means clustering on these embeddings, then gave each cluster to ChatGPT to summarize the corresponding comments. I have to admit, it was fun doing this and saved me lots of time!

    • Gooblebrai 6 days ago |
      Interesting approach. Did you tell GPT to summarise the comments of each cluster after grouping them?
  • empiko 6 days ago |
    My hot take: embeddings are overrated. They are overfitted on word overlap, leading to both many false positives and false negatives. If you identify a specific problem with them ("I really want to match items like these, but it does not work"), it is almost impossible to fix them. I often see them being used inappropriately, by people who read about their magical properties, but didn't really care about evaluating their results.
    • cheevly 6 days ago |
      You can easily fix this using embedding arithmetic to build embedding classifiers.
    • nostrebored 6 days ago |
      "I really want to match items like these, but it does not work" is just a fine tuning problem.
      • empiko 5 days ago |
        Yes, in a sense that if you have infinite appropriate dataset and compute. No, in a sense what is practically achievable.
        • nostrebored 5 days ago |
          You don't need infinite data. You need ~100k samples. It's also not particularly expensive.
          • empiko 5 days ago |
            Dude, you are talking total nonsense. What does 100k samples even mean. We have not even established what task we are talking about and you already know that you need that many samples. Not to be offensive, but you seem like the type of guy who believe in these magical properties.
            • nostrebored 5 days ago |
              … you established the task earlier? Item X and Item Y should be colocated in embedding space.

              This is what people using embedding models for recsys are doing. It’s not rocket science and it doesn’t require “infinite data”.

              By 100k samples I mean 100k samples that provide relevance feedback. 100k positive pairs.

              I’m working on these kinds of problems with actual customers. Not really sure where the hostility is coming from.

    • deepsquirrelnet 6 days ago |
      I think there is a deeper technical truth to this that hints at how much space there is to be gained in optimization.

      1) that matryoshka representations work so well, and as few as 64 dimensions account for a large majority of the performance

      2) that dimensional collapse is observed. Look at your cosine similarity scores and be amazed that everything is pretty similar and despite being a -1 to 1 scale, almost nothing is ever less than 0.8 for most models

      I think we’re at the infancy in this technology, even with all of the advances in recent years.

  • mrob 6 days ago |
    Embeddings are the only aspect of modern AI I'm excited about because they're the only one that gives more power to humans instead of taking it away. They're the "bicycle for our minds" of Steve Jobs fame; intelligence amplification not intelligence replacement. IMO, the biggest improvement in computer usability in my lifetime was the introduction of fast and ubiquitous local search. I use Firefox's "Find in Page" feature probably 10 or more times per day. I use find and grep probably every day. When I read man pages or logs, I navigate by search. Git would be vastly less useful without git grep. Embeddings have the potential to solve the biggest weakness of search by giving us fuzzy search that's actually useful.
    • gwervc 6 days ago |
      I agee with this view. Generative AI robs us of something (thinking, practicing) which is the long term ability to practice a skill and improve oneself in exchange of an immediate (often crappy) result. Embeddings is a tech that can help us solve problem, ut we still have to do most of the work.
      • wussboy 6 days ago |
        I’m not sure it robs us. It makes it possible, but many people including myself find the artistic products of AI to be utterly without value for the reasons you list. I will always cherish the product of lifelong dedication and human skill
        • jacobr1 6 days ago |
          It doesn't diminish - but I do find it interesting how it influences. Realism became less important, less interesting, though still valued to a lesser degree, with the ubiquity of photography. Where will human creativity move towards when certain task become trivially machine replicable? Where will human ingenuity _enabled_ by new technology make new art possible?
      • larve 6 days ago |
        I ask LLMs to give me exercises, tutorials then write up my experience into "course notes", along with flashcards. I ask it to simulate a teacher, I ask it to simulate students that I have to teach, etc...

        I haven't found a tool that is more effective in helping me learn.

        • greentxt 6 days ago |
          Great for learning for learning sake. Learning with the intention of pursuing a career requires the economic/job model too, which is the problem.
      • stocknoob 6 days ago |
        Does a player piano rob you of playing music yourself? A car from walking? A wheelbarrow from working out? It’s up to you if you want to stop practicing!

        Chess has become even more popular despite computers that can “rob us” of the joy. They’re even better practice partners.

        • crashabr 6 days ago |
          An individual car doesn't stop you from walking but a culture that centers cars leads to cities where walking is outright dangerous.

          Most car owners would never say outright "I want a car-centric culture". But car manufacturers lobbied for it, and step by step, we got both the deployment of useful car infrastructure, and the destruction or ignoring of all amenities useful for people walking or cycling.

          Now let's go back to the period where cars start to become enormously popular, and cities start to build neighborhoods without sidewalks. There was probably someone at the time complaining about the risk of cars overtaking walking and leading to stores being more far away etc. And in front of them was probably someone like you calling them a luddite and being oblivious of second order effects.

          • stocknoob 5 days ago |
            Land is a shared, zero-sum resource: a parking lot is not a park.

            Your software development methodology is your own. Why does someone else’s use of a tool deprive you of doing things the way you want?

    • TeMPOraL 6 days ago |
      So you're saying, embeddings are fine, as long as we refrain from making full use of their capabilities? We've hit on a mathematical construct that seems to be able to capture understanding, and you're saying that the biggest models are too big, we need to scale down, only use embeddings for surface-level basic similarities?

      I too think embeddings are vastly underutilized, and chat interface is not the be-all, end-all (not to mention, "chat with your program/PDF/documentation" just sounds plain stupid). However, whether current AI tools are replacing or amplifying your intelligence, is entirely down to how you use them.

      As for search, yes, that was a huge breakthrough and powerful amplifier. 2+ decades ago. At this point it's computer use 101 - which makes it sad when dealing with programs or websites that are opaque to search, and "ubiquitous local search" is still not here. Embeddings can and hopefully will give us better fuzzy/semantic searching, but if you push this far enough, you'll have to stop and ask - if the search tool is now capable to understand some aspects of my data, why not surface this understanding as a different view into data, instead of just invoking it in the background when user makes a search query?

      • autokad 6 days ago |
        I have found that embeddings + LLM is very successful. I'm going to make the words up as to not yield my work publicly, but I had to classify something into 3 categories. I asked a simple llm to label it, it was 95% accurate. taking the min distance from the word embeddings to the mean category embeddings was about 96%. When I gave gave the LLM the embedding prediction, the LLM was 98% accurate.

        There were issues an embedding model might not do well on where as the LLM could handle. for example: These were camel case words, like WoodPecker, AquafinaBottle, and WoodStock (I changed the words to not reveal private data). WoodPecker and WoodStock would end up with close embedding values because the word Wood dominated the embedding values, but these were supposed to go into 2 different categories.

        • bravura 6 days ago |
          Some of the best performing embedding models (https://huggingface.co/spaces/mteb/leaderboard) are LLMs. Have you tried them?
        • kkielhofner 6 days ago |
          > word Wood dominated the embedding values, but these were supposed to go into 2 different categories

          When faced with a similar challenge we developed a custom tokenizer, pretrained BERT base model[0], and finally a SPLADE-esque sparse embedding model[1] on top of that.

          [0] - https://huggingface.co/atomic-canyon/fermi-bert-1024

          [1] - https://huggingface.co/atomic-canyon/fermi-1024

          • bravura 5 days ago |
            Do you mind sharing why you chose SPLADE-esque sparse embeddings?

            I have been working on embeddings for a while.

            For different reasons I have recently become very interested in learned sparse embeddings. So I am curious what led you to choose them for your application, and why?

            • kkielhofner 5 days ago |
              > Do you mind sharing why you chose SPLADE-esque sparse embeddings?

              I can provide what I can provide publicly. The first thing we ever do is develop benchmarks given the uniqueness of the nuclear energy space and our application. In this case it's FermiBench[0].

              When working with operating nuclear power plants there are some fairly unique challenges:

              1. Document collections tend to be in the billions of pages. When you have regulatory requirements to extensively document EVERYTHING and plants that have been operating for several decades you end up with a lot of data...

              2. There are very strict security requirements - generally speaking everything is on-prem and hard air-gapped. We don't have the luxury of cloud elasticity. Sparse embeddings are very efficient especially in terms of RAM and storage. Especially important when factoring in budgetary requirements. We're already dropping in eight H100s (minimum) so it starts to creep up fast...

              3. Existing document/record management systems in the nuclear space are keyword search based if they have search at all. This has led to substantial user conditioning - they're not exactly used to what we'd call "semantic search". Sparse embeddings in combination with other techniques bridge that well.

              4. Interpretability. It's nice to be able to peek at the embedding and be able to get something out of it at a glance.

              So it's basically a combination of efficiency, performance, and meeting users where they are. Our Fermi model series is still v1 but we've found performance (in every sense of the word) to be very good based on benchmarking and initial user testing.

              I should also add that some aspects of this (like pretrained BERT) are fairly compute-intense to train. Fortunately we work with the Department of Energy Oak Ridge National Laboratory and developed all of this on Frontier[1] (for free).

              [0] - https://huggingface.co/datasets/atomic-canyon/FermiBench

              [1] - https://en.wikipedia.org/wiki/Frontier_(supercomputer)

              • teleforce 3 days ago |
                These is excellent comment, can someone put it inside the highlights.
      • User23 5 days ago |
        > We've hit on a mathematical construct that seems to be able to capture understanding

        I’m admittedly unfamiliar with the space, but having just done some reading that doesn’t look to be true. Can you elaborate please and maybe point to some external support for such a bold claim?

        • TeMPOraL 4 days ago |
          > Can you elaborate please and maybe point to some external support for such a bold claim?

          SOTA LLMs?

          If you think about what, say, a chair or electricity or love are, or what it means for something to be something, etc., I believe you'll quickly realize that words and concepts don't have well-defined meanings. Rather, we define things in terms of other things, which themselves are defined in terms of other things, and so on. There's no atomic meaning, the meaning is in the relationships between the thought and other thoughts.

          And that is exactly what those models capture. They're trained by consuming a large amount of text - but not random text, real text - and they end up positioning tokens as points in high-dimensional space. As you increase the number of dimensions, there's eventually enough of them that any relationship between any two tokens (or groups; grouping concepts out of tokens is just another relationship) can be encoded in the latent space as proximity along some vector.

          You end up with real computational artifact that's implementing the idea of defining concepts only in terms of other concepts. Now, between LLMs and the ability to identify and apply arbitrary concepts with vector math, I believe that's as close to the idea of "understanding" as we've ever come.

          • User23 4 days ago |
            That does sound a bit like Peircian semiotic so I’m with you so far as the general concept of meaning being a sort of iterative construct.

            Where I don’t follow is how a bitmap approximation captures that in a semiotic way. As far as I can tell the semiosis still is all occurring in the human observer of the machine’s output. The mathematics still hasn’t captured the interpretant so far as I can see.

            Regardless of my possible incomprehension, I appreciate your elucidation. Thanks!

      • mnky9800n 5 days ago |
        I feel like embeddings will be more powerful for understanding high dimensional physics than language because chaotic system predictability is limited by its compressability. Therefore an embedding is able to capture how exactly compressible the system is and therefore can extend the predictability as far as possible.
    • inbread 6 days ago |
      I've been experimenting with using embeddings for finding the relevant git commits, as I often don't know or remember the exact word that was used. So I created my own little tool for embedding and finding commits by commit messages. Maybe you'll also find it useful: https://github.com/adrianmfi/git-semantic-similarity
      • chamomeal 5 days ago |
        Very cool, I’ll try this out!
      • porridgeraisin 4 days ago |
        Nice! Let me try this out
    • mgraczyk 6 days ago |
      All modern AI technology can give more power to humans, you just have to use the right tools. Every AI tool I can think of has made me more productive.

      LLMs help me write code faster and understand new libraries, image generation helps me build sites and emails faster, etc

    • attentive 5 days ago |
      there is fzf, depending on your definition of "useful"
  • imgabe 6 days ago |
    Is there any benefit to fine-tuning a model on your corpus before using it to generate embeddings? Would that improve the quality of the matches?
    • gunalx 6 days ago |
      Yes. Especially if you work in a not well supported language and/or have specific datapairs you want to match that might be out of ordinary text.

      Training your own fine tune takes a really short time and GPU resources, and you can easily outperform even sota models on your specific problem with a smaller model/vector space

      Then again on general English text and doing a basic fuzzy search. I would not really expect high performance gains.

  • tomthe 6 days ago |
    Nice introduction, but I think that ranking the models purely by their input token limits is not a useful exercise. Looking at the MTEB leaderboard is better (although a lot of the models are probably overfitting to their test set).

    This is a good time to chill for my visualization of 5 Millionembeddings of HN posts, users and comments: https://tomthe.github.io/hackmap/

    • kaycebasques 6 days ago |
      Thanks, a couple other people gave me this same feedback in another comment thread and it definitely makes sense not to overindex on input token size. Will update that section in a bit.
  • l5870uoo9y 6 days ago |
    Are there any visualization libraries that visualize embeddings in a vector space?
  • adamgordonbell 6 days ago |
    I was using embeddings to group articles by topic, and hit a specific issue. Say I had 10 articles about 3 topics, and articles are either dry or very casual in tone.

    I found clustering by topic was hard, because tone dimensions ( whatever they were ) seemed to dominate.

    How can you pull apart the embeddings? Maybe use an LLM to extract a topic, and then cluster by extracted topic?

    In the end I found it easier to just ask an LLM to group articles by topic.

    • eamag 6 days ago |
      I agree, I tried several methods during my pet project [1], and all of them have their pros and cons. Looks like creating topics first and predicting them using LLM works the best

      [1] https://eamag.me/2024/Automated-Paper-Classification

    • coredog64 6 days ago |
      Allegedly, the new hotness in RAG is exactly that. Use a smaller LLM to summarize the article and include that summary alongside the article when generating the embedding.

      Potentially solves your issue, but it is also handy when you have to chunk a larger document and would lose context from calculating the embedding just on the chunk.

  • joerick 6 days ago |
    The thing that puzzles me about embeddings is that they're so untargeted, they represent everything about the input string.

    Is there a method for dimensionality reduction of embeddings for different applications? Let's say I'm building a system to find similar tech support conversations and I am only interested in the content of the discussion, not the tone of it.

    How could I derive an embedding that represents only content and not tone?

    • adamgordonbell 6 days ago |
      Agreed.. biggest problem with off the shelf embeddings I hit. Need a way to decompose embeddings.
    • johndough 6 days ago |
      You can do math with word embeddings. A famous example (which I now see has also been mentioned in the article) is to compute the "woman vector" by subtracting "man" from "woman". You can then add the "woman vector" to e.g. the "king" vector to obtain a vector which is somewhat close to "queen".

      To adapt this to your problem of ignoring writing style in queries, you could collect a few text samples with different writing styles but same content to compute a "style direction". Then when you do a query for some specific content, subtract the projection of your query embedding onto the style direction to eliminate the style:

          query_without_style = query - dot(query, style_direction) * style_direction
      
      I suspect this also works with text embeddings, but you might have to train the embedding network in some special way to maximize the effectiveness of embedding arithmetic. Vector normalization might also be important, or maybe not. Probably depends on the training.

      Another approach would be to compute a "content direction" instead of a "style direction" and eliminate every aspect of a query that is not content. Depending on what kind of texts you are working with, data collection for one or the other direction might be easier or have more/fewer biases.

      And if you feel especially lazy when collecting data to compute embedding directions, you can generate texts with different styles using e.g. ChatGPT. This will probably not work as well as carefully handpicked texts, but you can make up for it with volume to some degree.

      • joerick 6 days ago |
        Interesting, but your hypothesis assumes that 'tone' is one-dimensional, that there is a single axis you can remove. I think tone is very multidimensional, I'd expect to be removing multiple 'directions' from the embedding.
        • johndough 6 days ago |
          You could of course compute multiple "tone" directions for every "tone" you can identify and subtract all of them. It might work better, but it will definitely be more work.
        • jerf 6 days ago |
          I would say rather that the "standard example" is simplified, but it does capture an essential truth about the vectors. The surprise is not that the real world is complicated and nothing is simply expressible as a vector and that treating it as such doesn't 100% work in every way in every circumstance all of the time. That's obvious. Everyone who might work with embeddings gets it, and if they don't, they soon will. The surprise is that it does work as well as it does and does seem to be capturing more than a naive skepticism would expect.
        • mattnewton 6 days ago |
          No, I don’t think the author is saying one dimensional - the vectors are represented by magnitudes in almost all of the embedding dimensions.

          They are still a “direction” in the way that [0.5, 0.5] in x,y space is a 45 degree angle, and in that direction it has a magnitude of around 0.7

          So of course you could probably define some other vector space where many of the different labeled vectors are translated to magnitudes in the original embedding space, letting you do things like have a “tone” slider.

          • TeMPOraL 6 days ago |
            I think GP is saying that GGP assumes "tone" is one direction, in the sense there exists a vector V representing "tone direction", and you can scale "tone" independently by multiplying that vector with a scalar - hence, 1 dimension.

            I'd say this assumption is both right and wrong. Wrong, because it's unlikely there's a direction in embedding space corresponding to a platonic ideal of "tone". Right, because I suspect that, for sufficiently large embedding space (on the order of what goes into current LLMs), any continuous concept we can articulate will have a corresponding direction in the embedding space, that's roughly as sharp as our ability to precisely define the concept.

    • loa_in_ 6 days ago |
      They don't represent everything. In theory they do but in reality the choice of dimensions is a function of the model itself. It's unique to each model.
      • joerick 6 days ago |
        Yeah, 'everything' as in 'everything that the model cares about' :)
    • macNchz 6 days ago |
      Depends on the nature of the content you’re working with, but I’ve had some good results using an LLM during indexing to generate a search document by rephrasing the original text in a standardized way. Then you can search against the embeddings of that document, and perhaps boost based on keyword similarity to the original text.
      • joerick 6 days ago |
        Nice workaround. I just wish there was a less 'lossy' way to go about it!
        • jacobr1 6 days ago |
          Could you explicitly train a set of embeddings that performed that step in the process? For example which computing the loss, you compare the difference against the normalized text rather than the original. Or alternatively do this as a fine-tuning. Then you would have embedding that optimized for the characteristics you care about.
        • hobs 6 days ago |
          Normal full text search stuff helps reduce the search space - eg lemming, stemming, query simplification stuff were all way before LLMs.
      • mrshu 6 days ago |
        This is also often referred to as Hypothetical Document Embeddings (https://arxiv.org/abs/2212.10496).
        • adamgordonbell 6 days ago |
          Do you have examples of this? Please say more!
    • mrshu 6 days ago |
      Though not exactly what you are after, Contextual Document Embeddings (https://huggingface.co/jxm/cde-small-v1), which generate embeddings based on "surrounding context" might be of some interest.

      With 281M params it's also relatively small (at least for an embedding model) so one can play with it relatively easily.

    • nostrebored 6 days ago |
      There are a few things you can do. If these access patterns are well known ahead of time, you can train subdomain behavior into the embedding models by using prefixing. E.g. content: fixing a broken printer, tone: frustration about broken printer, and "fixing a broken printer" can all be served by a single model.

      We have customers doing this in production in other contexts.

      If you have fundamentally different access patterns (e.g. doc -> doc retrieval instead of query -> doc retrieval) then it's often time to just maintain another embedding index with a different model.

    • _pastel 6 days ago |
      You could fine-tune the embedding model to reduce cosine distance on a more specific function.
    • magicalhippo 5 days ago |
      I've just begun to dabble with embeddings and LLMs, but recently I've been thing about tryin to use principle component analysis[1] to either project to desirable subspaces, or project out undesirable subspaces.

      In your case it would be to take a bunch of texts which roughly mean the same thing but with variance in tone, compute PCA of the normalized embeddings, take the top axsis (or top few) and project it out (ie subtract the projection) of the embeddings for the documents you care about before doing the cosine similarity.

      Something along those lines.

      Could be it's a terrible idea, haven't had time to do much with it yet due to work.

      [1]: https://en.wikipedia.org/wiki/Principal_component_analysis

  • NameError 6 days ago |
    This article really resonates with me - I've heard people (and vector database companies) describe transformer embeddings + vector databases as primarily a solution for "memory/context for your chatbot, to mitigate hallucinations", which seems like a really specific (and kinda dubious, in my experience) use case for a really general tool.

    I've found all of the RAG applications I've tried to be pretty underwhelming, but semantic search itself (especially combined with full-text search) is very cool.

    • moffkalast 6 days ago |
      I dare say RAG with vector DBs is underwhelming because embeddings are not underrated but appropriately rated, and will not give you relevant info in every case. In fact, the way LLMs retrieve info internally [0] already works along the same principle and is a large factor in their unreliability.

      [0] https://nonint.com/2023/10/18/is-the-reversal-curse-a-genera...

  • dmezzetti 6 days ago |
    Author of txtai (https://github.com/neuml/txtai) here. I've been in the embeddings space since 2020 before the world of LLMs/GenAI.

    In principle, I agree with much of the sentiment here. Embeddings can get you pretty far. If the goal is to find information and citations/links, you can accomplish most of that with a simple embeddings/vector search.

    GenAI does have an upside in that it can distill and process those results into something more refined. One of the main production use cases is retrieval augmented generation (RAG). The "R" is usually a vector search but doesn't have to be.

    As we see with things like ChatGPT search and Perplexity, there is a push towards using LLMs to summarize the results but also linking to the results to increase user confidence. Even Google Search now has that GenAI section at the top. In general, users just aren't going to accept LLM responses without source citations at this point. The question is if the summary provides value or if the citations really provide the most value. If it's the later, then Embeddings will get the job done.

  • esafak 6 days ago |
    Underrated by people are unfamiliar with machine learning, maybe.
    • vindex10 6 days ago |
      I actually tend to agree. In the article, I didn't see the strong argument highlighting what powerful feature exactly people were missing in relation to embeddings. Those who work in ML they probably know these basics.

      It is a nice read though - explaining the basics of vector spaces, similarity and how it is used in modern ML applications.

      • kaycebasques 6 days ago |
        > Hopefully it's clear from the domain name and intro that I'm suggesting technical writers are underrating how useful embeddings can be in our work. I know ML practitioners do not underrate them.

        https://news.ycombinator.com/item?id=42014036

        > I didn't see the strong argument highlighting what powerful feature exactly people were missing in relation to embeddings

        I had to leave out specific applications as "an exercise for the reader" for various reasons. Long story short, embeddings provide a path to make progress on some of the fundamental problems of technical writing.

        • vindex10 4 days ago |
          thank you for explanation, yes I later encountered your answer and upvoted it.

          > I had to leave out specific applications as "an exercise for the reader" this is very unfortunate. would be very interesting to hear some intel :)

    • lokar 6 days ago |
      Even by ML people from 25 years ago. It’s a black box function that maps from a ~30k space to a ~1k space. It’s a better function then things like PCA, but does the same thing.
    • kkielhofner 6 days ago |
      LLMs have nearly completely sucked the oxygen out of the room when it comes to machine learning or "AI".

      I'm shocked at the number of startups, etc you see trying to do RAG, etc that basically have no idea what they are, how they actually work, etc.

      The "R" in RAG stands for retrieval - as in the entire field of information retrieval. But let's ignore that and skip right to the "G" (generative)...

      Garbage in, garbage out people!

  • jonathanrmumm 6 days ago |
    Embeddings are a new jump to universality, like the alphabet or numbers. https://thebeginningofinfinity.xyz/Jump%20to%20Universality
    • OutOfHere 6 days ago |
      Mind-blowing. In effect, among humans, what separates the civilized from the crude is the quest for universality among the civilized. To say it differently, thinking in terms of attaining universality is the mark of a civilized mind.

      I made an episode to appreciate the book: https://podcasters.spotify.com/pod/show/podgenai/episodes/Th...

  • freediver 6 days ago |
    What would be really cool if somebody figured out how to do embeddings -> text.
    • kabla 6 days ago |
      Is it not possible? I'm not that familiar with the topic. Doing some sort of averaging over a large corpus of separate texts could be interesting and probably would also have a lot of applications. Let's say that you are gathering feedback from a large group of people and want to summarize it in an anonymized way. I imagine you'd need embeddings with a somewhat large dimensionality though?
    • cubefox 6 days ago |
      I wonder if someone has already tried to do that. Though this might go in a similar direction: https://arxiv.org/abs/1711.00043
    • 0x1ceb00da 6 days ago |
      That's chatgpt
    • kaibee 6 days ago |
      Hmm as a very stupid first pass...

      0. Generate an embedding of some text, so that you have a known good embedding, this will be your target.

      1. Generate an array of random tokens the length of the response you want.

      2. Compute the embedding of this response.

      3. Pick a random sub-section of the response and randomize the tokens in it again.

      4. Compute the embedding of your new response.

      5. If the embeddings are closer together, keep your random changes, otherwise discard them, go back to step 2.

      6. Repeat this process until going back to step 2 stops improving your score. Also you'll probably want to shrink the size of the sub-section you're randomizing the closer your computed embedding is to your target embedding. Also you might be able to be cleverer by doing some kind of masking strategy? Like let's say the first half of your response text already was actually the true text of the target embedding. An ideal randomizer would see that randomizing the first half almost always makes the result worse, and so would target the 2nd half more often (I'm hoping that embeddings work like this?).

      7. Do this N times and use an LLM to score and discard the worst N-1 results. I expect that 99.9% of the time you're basically producing adversarial examples w/ this strategy.

      8. Feed this last result into an LLM and ask it to clean it up.

    • OutOfHere 6 days ago |
  • ericholscher 6 days ago |
    This is a great post. I’ve also been having a lot of fun working with embeddings, with lots of those pages being documentation. We write up a quick post on how are using them in prod, if you want to go from having an embedding to actually using them in a web app:

    https://www.ethicalads.io/blog/2024/04/using-embeddings-in-p...

    • kaycebasques 6 days ago |
      Thanks, Eric. So what you're really telling me is that you might make an exception to the "no tools talks" general policy for Write The Docs conference talks and let me nerd out on embeddings for 30 mins?? ;P
      • ericholscher 6 days ago |
        Haha. I think they are definitely relevant, and I’d call them a technology more than a tool.

        That is mostly just that we don’t want folks going up and doing a 30 minute demo of Sphinx or something :-)

  • huijzer 6 days ago |
    > Is it terrible for the environment?

    > I don’t know. After the model has been created (trained), I’m pretty sure that generating embeddings is much less computationally intensive than generating text. But it also seems to be the case that embedding models are trained in similar ways as text generation models2, with all the energy usage that implies. I’ll update this section when I find out more.

    Although I do care about the environment, this question is completely the wrong one if you ask me. There is the public opinion (mainstream media?) some kind of idea that we should use less AI and somehow this would solve our climate problems.

    As a counterexample, let's go to the extreme. Let's ban Google Maps because it does take computational resources from the phone. As a result more people will take wrong routes, and thus use more petrol. Say you use one gallon of petrol extra, that then wastes 34 kWh. This is of course the equivalent of running 34 powerful vacuum cleaners on full power for an hour. In contrast, say you downloaded your map, then the total "cost" is only the power used by the phone. A mobile phone has a battery of about 4 mAh, so 0,004 Ah * 4.2 V = 0.168 W, or 0.000168 kW. This means that the phone is about 200 000 times as efficient! And then we didn't even consider the time-saving for the human.

    It's the same with running embeddings for doc generation. An Nvidia H100 consumes about 700 W, so say 1 kWh after an hour of running. 1 kWh should be enough to do a bunch of embedding runs. If this then saves, for example, one workday including the driving back and forth to the office, then again the tradeoff is highly in favor of the compute.

    • archerx 6 days ago |
      If people really cared about the environment then they would ban residential air conditioning, it’s a luxury.

      I lived somewhere that would get to 40c in the summers and an oscillating fan was good enough to keep cool, the AC was nice to have but it wasn’t necessary.

      I find it very hypocritical when people tell you to change your lifestyle for climate change but they have the A/C blasting all day long, everyday.

      • wussboy 6 days ago |
        “Reduce” was never going to work. Only the deep electrification of our economy will save us.
        • TeMPOraL 5 days ago |
          Indeed, and A/C is kind of a prime example of why it's beneficial, given how much more energy-efficient heat pumps are for cooling and heating than just about anything else.
      • ausbah 6 days ago |
        in a somewhat ironic twist, ac is crucial survival infrastructure in some parts of the world when heat comes during hot seasons. phoenix usa, parts of india, etc
        • archerx 5 days ago |
          AC is never crucial, you just want to be comfortable at the expense or th plane. How did people live there for centries before AC? Magic?
          • immibis 4 days ago |
            By not working when it's too hot to work, or by simply not living there, both of which are strictly verboten solutions in hypercapitalism.
      • coredog64 6 days ago |
        Prepare to have your mind blown: Heat pumps are very energy efficient at moving heat around. Significantly more so than the oil-fired boiler frequently found in the basement of a big-city apartment building.
      • BeetleB 6 days ago |
        > I lived somewhere that would get to 40c in the summers and an oscillating fan was good enough to keep cool

        If it's not too humid.

        I've lived in places with relatively high humidity and 35-40C, and have had the misfortune of not having an AC. Fans are not enough. I mean, sure, you can survive, but it really, really sucks.

      • petesergeant 6 days ago |
        > that would get to 40c in the summers and an oscillating fan was good enough to keep cool

        This is entirely meaningless without providing the humidity. At higher than 70% relative humidity 40C is potentially fatal.

        • archerx 5 days ago |
          It was pretty humid too, a fan was ok. Americans are just very weak and are babies when it comes to temperature.
      • TeMPOraL 5 days ago |
        > I lived somewhere that would get to 40c in the summers and an oscillating fan was good enough to keep cool, the AC was nice to have but it wasn’t necessary.

        As others have said, that works if you lived in a very dry area. And perhaps it was a house or a building optimized for airflow. And you didn't have much to do during the day. And I'm guessing you're young, and healthy.

        Here in central Europe, sustained 40°C in the summer would rack up a significant body count. Anything above 30°C sucks really bad if you have any work to do. Or if, say, you're pregnant or have small children. And you live in a city.

        Residential A/C isn't a luxury anymore, it's becoming a necessity very fast. Fortunately, heat pumps are one of the most efficient inventions of humankind. In particular, "A/C blasting all day long" beats anything else you could do to mitigate the heat if it involved getting into a car. And then it also beats whatever else you're doing to heat your place during the winter.

    • dleeftink 6 days ago |
      Long-term, its not about barring progress, but to have progress and more energy efficient models. The sum-games we play in regard to energy usage, don't necessarily stack up in dynamic systems; the increased energy usage of generative models may very well lead to less compute hours spent behind the desk drafting, revising, redrafting and doing it all over again once the next project comes around.

      What remains though, is that increased productivity has rarely lead to a decrease in energy usage. Whether energy scarcity will drive model optimisation is anyone's guess, but it would be a differentiating feature on a market saturated with similarly capable offerings.

  • ggnore7452 6 days ago |
    if anything i would consider embeddings bit overrated, or it is safer to underrate them.

    They're not the silver bullet many initially hoped for, they're not a complete replacement for simpler methods like BM25. They only have very limited "semantic understanding" (and as people throw increasingly large chunks into embedding models, the meanings can get even fuzzier)

    Overly high expectations lets people believe that embeddings will retrieve exactly what they mean, and With larger top-k values and LLMs that are exceptionally good at rationalizing responses, it can be difficult to notice mismatches unless you examine the results closely.

    • nostrebored 6 days ago |
      Off the shelf embedding models definitely underpromise and overdeliver. In ten years I'd be very surprised if companies weren't fine-tuning embedding models for search based on their data in any competitive domains.
      • kkielhofner 6 days ago |
        My startup (Atomic Canyon) developed embedding models for the nuclear energy space[0].

        Let's just say that if you think off-the-shelf embedding models are going to work well with this kind of highly specialized content you're going to have a rough time.

        [0] - https://huggingface.co/atomic-canyon/fermi-1024

    • kkielhofner 6 days ago |
      > they're not a complete replacement for simpler methods like BM25

      There are embedding approaches that balance "semantic understanding" with BM25-ish.

      They're still pretty obscure outside of the information retrieval space but sparse embeddings[0] are the "most" widely used.

      [0] - https://zilliz.com/learn/sparse-and-dense-embeddings

    • deepsquirrelnet 6 days ago |
      Absolutely. Embeddings have been around a while and most people don’t realize it wasn’t until the e5 series of models from Microsoft that they even benchmarked as well as BM25 in retrieval scores, while being significantly more costly to compute.

      I think sparse retrieval with cross encoders doing reranking is still significantly better than embeddings. Embedding indexes are also difficult to scale since hnsw consumes too much memory above a few million vectors and ivfpq has issues with recall.

  • mlinksva 6 days ago |
    https://technicalwriting.dev/data/embeddings.html#let-a-thou...

    > As docs site owners, I wonder if we should start freely providing embeddings for our content to anyone who wants them, via REST APIs or well-known URIs. Who knows what kinds of cool stuff our communities can build with this extra type of data about our docs?

    Interesting idea. You'd have to specify the exact embedding model used to generate an embedding, right? Is there a well understood convention for such identification like say model_name:model_version:model_hash or something? For technical docs, obviously very broad field, is there an embedding model (or small number) widely used or obviously highly suitable that a site ownwer could choose one and have some reasonable expectation that publishing embeddings for their docs generated using that model would be useful to others? (Naive questions, I am not embedded in the field.)

    • skybrian 6 days ago |
      It seems like sharing the text itself would be a better API, since it lets API users calculate their own embeddings easily. This is what the crawlers for search engines do. If they use embeddings internally, that’s up to them, and it doesn’t need to be baked into the protocol.
    • treefarmer 6 days ago |
      Yeah, this is the main issue with the suggestion. Embeddings can only be compared to each other if they are in the same space (e.g., generated by the same model). Providing embeddings of a specific kind would require users to use the same model, which can quickly become problematic if you're using a closed-source embedding model (like OpenAI's or Cohere's).
    • nkko 5 days ago |
      Could we work toward standardization at some point? Obviously, there will always be a newer model. I just hate that all the embedding work I did was with now depreciated openai model. At least single providers should see interest in ensuring that for their own model releases. Some trick like matryoshka embedding could secure that embedding from newer models nest or work within the space of older model preserving some form of comparability or alignment
  • OutOfHere 6 days ago |
    This article shows the incorrect value for the OpenAI text-embedding-3-large Input Limit as 3072 which is actually its output limit [1]. The correct value is 8191 [2].

    Edit: This value has now been fixed in the article.

    [1] https://platform.openai.com/docs/models/embeddings#embedding...

    [2] https://platform.openai.com/docs/guides/embeddings/#embeddin...

    Also, what each model means by a token can be very different due to the use of different model-specific encodings, so ultimately one must compare the number of characters, not tokens.

  • tootie 6 days ago |
    Is it accurate to say that any data that can be tokenized can be turned into embeddings?
  • eproxus 6 days ago |
    I wonder if this can be used to detect code similarity between e.g. function or files etc.? Or are the existing algorithms overly trained on written prose?
    • OutOfHere 6 days ago |
      Yes, of course it can be used in that way, but the quality of the result depends on whether the model was also trained on such code or not.
  • hambandit 5 days ago |
    Embeddings from things like one-hot, count vectorization, tf-idf, etc into dimensionality reduction techniques like SVD and PCA have been around for a long time and also provided the ability to compare any two pieces of text to each other. Yes, neural networks and LLMs have provided the ability for the context of each word to affect the whole document's embedding and capture more meaning, potentially that pesky "semantic" sort even; but they still are fundamentally a dimensionality reduction technique.
  • ABraidotti 5 days ago |
    This reminds me- I gotta go back and reread Borges's short stories with ML theory in mind.
  • luizsantana 5 days ago |
    Embeddings are indeed great. I have been using it a lot.

    Even wrote about it at: https://blog.dobror.com/2024/08/30/how-embeddings-make-your-...

  • 0x20cowboy 5 days ago |
    I have made several successful products in the past few years using primarily embeddings and cosine similarity. Can recommend. It’s amazingly effective (compared to what most people are using today anyway).
  • rgavuliak 4 days ago |
    The title of the post says they are underrated, but doesn't provide any real justification beyond saying - they are good for x.

    I am not denying their usefulness, but it's misleading.

  • _jonas 4 days ago |
    It's fun to try and guess what semantic concepts might be captured within individual dimensions / pairs of dimensions of the embeddings space.