Spann: Highly-Efficient Billion-Scale Approximate Nearest Neighbor Search (2021)

123 points by ksec 5 days ago | 33 comments

aaronblohowiak 5 days ago |
Kinda related, hopefully someone here in comments can help: what’s your favorite precise nn search that works on arm Macs for in memory dataset; 100k times / 300 float32 dims per item ? Ideally supporting cosine similarity
Faiss seems big to get going, tried n2 but doesn’t seem to want to install via pip.. if anyone has a go-to I’d be grateful. Thanks.
teaearlgraycold 5 days ago |
Why in memory? What are your latency requirements? I found pgvector to be surprisingly performant.
ebursztein 5 days ago |
Try Usearch - it's really fast and under rated https://github.com/unum-cloud/usearch
contravariant 5 days ago |
Out of interest is nearest neighbour even remotely effective with 300 dimensions?
Seems to me that unless most of the variation is in only a couple of directions, pretty much no points are going to be anywhere near one another.
So with cosine similarity you're either going to get low scores for pretty much everything or a basic PCA should be able to reduce the dimensionality significantly.
ttul 5 days ago |
I think you are referring to what's known as the "curse of dimensionality," where as dimensionality increases, the distance between points tends to become more uniform and large. However, nearest neighbor search can still work effectively because of several key factors:
1. Real data rarely occupies the full high-dimensional space uniformly. Instead, it typically lies on or near a lower-dimensional manifold embedded within the high-dimensional space. This is often called the "manifold hypothesis."
2. While distances may be large in absolute terms, _relative_ distances still maintain meaningful relationships. If point A is closer to point B than to point C in this high-dimensional space, that proximity often still indicates semantic similarity.
3. The data points that matter for a given problem often cluster in meaningful ways. Even in high dimensions, these clusters can maintain separation that makes nearest neighbor search useful.
Let me give a concrete example: Consider a dataset of images. While an image might be represented in a very high-dimensional space (e.g., thousands of pixels), images of dogs will tend to be closer to other dog images than to images of cars, even in this high-dimensional space. The meaningful features create a structure that nearest neighbor search can exploit.
Spam filtering is another area where nearest neighbor is used to good effect. When you know that a certain embedding representing a spam message (in any medium - email, comments, whatever), then if other messages come along and are _relatively_ close to that one, you may conclude that they are on the right side of the manifold to be considered spam.
You could train a special model to define this manifold, but spam changes all the time and constant training doesn’t work well.
osigurdson 4 days ago |
>> as the "curse of dimensionality," where as dimensionality increases, the distance between points tends to become more uniform and large
Since embeddings are the middle layer of an ANN, doesn't this suggest that there are too many dimensions used during training. I would think a training goal would be to have relatively uniform coverage of the space
ttul 3 days ago |
The curse of dimensionality in neural network embeddings actually serves a valuable purpose, contrary to what might seem intuitive. While the tendency for distances to become more uniform and large in high-dimensional spaces might appear problematic, this characteristic helps maintain clear separation between different semantic concepts. Rather than aiming for uniform coverage of the embedding space, neural networks benefit from having distinct clusters with meaningful gaps between them – much like how a library benefits from having clear separation between different subject areas, rather than books being randomly distributed throughout the space.
The high dimensionality of embeddings provides several key advantages during training and inference. The additional dimensions allow the network to better preserve both local and global relationships between data points, while providing the capacity to capture subtle semantic nuances. This extra capacity helps prevent information bottlenecks and provides more pathways for gradient descent during training, leading to more stable optimization. In essence, the high-dimensional nature of embeddings, despite its counterintuitive properties, is a feature rather than a bug in neural network architecture.
TL;DR: High-dimensional embeddings in neural networks are actually beneficial - the extra dimensions help keep different concepts clearly separated and make training more stable, even though the distances between points become more uniform.
mhuffman 5 days ago |
>Out of interest is nearest neighbour even remotely effective with 300 dimensions?
They are/can be anyway. I had data with 50,000 dimensions, which after applying dimensionality reduction techniques got it "down to" around 300! ANN worked very well on those vectors. This was prior to the glut of vector dbs we have available now, so it was all in-memory and used direct library calls to find neighbors.
geysersam 5 days ago |
When dimensionality increases so does distance, but distance doesn't matter, we only care about relative distance compared to different points.
If clustering works or not has nothing to do with the dimensionality of the space, and everything to do with the distribution of the points.
peterldowns 5 days ago |
Annoy
lmcinnes 5 days ago |
If you just want in-memory then PyNNDescent (https://github.com/lmcinnes/pynndescent) can work pretty well. It should install easily with pip, works well at the scales you mention, and supports a large number of metrics, including cosine.
mhuffman 5 days ago |
Annoy is old, but works surprisingly well and is fast.
wood_spirit 4 days ago |
And nowadays Spotify uses voyager https://engineering.atspotify.com/2023/10/introducing-voyage...
mhuffman 4 days ago |
And also looks like full support for Mac ARM so, good info wood_spirit.
visarga 4 days ago |
For just 100K items why don't you simply load the embeds into numpy and use cosine similarity directly? It's like 2 lines of code and works well for "small" number of documents. This would be exact NN search.
Use approximate NN search when you have high volume of searches over millions of vectors.
HDThoreaun 4 days ago |
100k 300 dimension float 32s is less than a gigabyte. Just use numpy to do the NN search in memory.
singhrac 5 days ago |
Maybe worth a (2021) tag.
rbranson 5 days ago |
One of the only (the only?) commercial grade implementations was launched recently by us at PlanetScale:
https://planetscale.com/blog/announcing-planetscale-vectors-...
noahbp 5 days ago |
No ability to host offline, and for 1/8th CPU + 1GB RAM + 800 GB storage, the price is $1,224/month?
I'm sure it works great, but at that price point, I'm stuck with self-hosting Postgres+pgvector.
TechDebtDevin 5 days ago |
Which works completely fine as long as you know how to manage your own db without getting wrecked!
But yes, I it seems extereme. But it is also cheaper than hiring a dedicated postgres/db guy who will cost 5 to 10x more per month.
mhuffman 5 days ago |
There are plenty of set-it-and-forget-it vector dbs right now, maybe too many![0]
[0]https://news.ycombinator.com/item?id=41985176
TechDebtDevin 2 days ago |
For sure, I personally use pgvector myself but I also don't have millions and millions of rows. I haven't messed with anything other than Pinecone so I can't speak to those services, but there's a big difference than a vector db for your own personal use and a chat app/search on a db with millions of users convos and docs, not sure how well these managed vector DB platforms scale, but you probably need the db guy anyways when you're using vectors at scale. Atleast I would.
bddicken 5 days ago |
Just pointing out that what you're paying for is actually 3x these resources. By default you get a primary server and two replicas with whatever specification you choose. This is primarily for data durability, but you can also send queries to your replicas.
3abiton 4 days ago |
What's the advantage of NN over vectordb anymore? Are we losing some info when we embed?
Sirupsen 4 days ago |
It works great. We’ve had SPANN in production since October of 2023 at https://turbopuffer.com/
bratao 4 days ago |
SPANN is also implemented in the open-source Vespa.ai
jbellis 4 days ago |
Actual SPANN or janky "inspired by SPANN" IVF with HNSW in front? Only real SPANN (with SPTAG, and partitioning designed to work with SPTAG) delivers good results. A superficial read of the paper LOOKS like you can achieve similar results by throwing off the shelf components at it, but it doesn't actually work well.
uptownfunk 5 days ago |
Can we build an OS version of this and make it easy for solo dev to self host / roll their own?
graycat 4 days ago |
Hmm, how to do a statistical hypothesis of nearest neighbor data? Distribution free?
thecleaner 3 days ago |
What's the hypothesis and the test statistic ?
graycat a day ago |
It's been a long time! Glad you are interested. I gave a talk on it at the main NASDAQ computer site.
The null hypothesis is that some one point (a vector in R^n for the reals R and positive integer n) is from an independent, identically distributed set of random variables. The cute part was the use of the distribution property of tightness for the test statistic. The intended first application was for monitoring computer systems and improve on the early AI expert systems.
utopcell 4 days ago |
For anyone that wants to see how this compares on ann-benchmarks.com, the project is called 'sptag'.
almaight 3 days ago |
Already integrated on bing
https://www.microsoft.com/en-us/research/uploads/prod/2021/1...