Faiss seems big to get going, tried n2 but doesn’t seem to want to install via pip.. if anyone has a go-to I’d be grateful. Thanks.
Seems to me that unless most of the variation is in only a couple of directions, pretty much no points are going to be anywhere near one another.
So with cosine similarity you're either going to get low scores for pretty much everything or a basic PCA should be able to reduce the dimensionality significantly.
1. Real data rarely occupies the full high-dimensional space uniformly. Instead, it typically lies on or near a lower-dimensional manifold embedded within the high-dimensional space. This is often called the "manifold hypothesis."
2. While distances may be large in absolute terms, _relative_ distances still maintain meaningful relationships. If point A is closer to point B than to point C in this high-dimensional space, that proximity often still indicates semantic similarity.
3. The data points that matter for a given problem often cluster in meaningful ways. Even in high dimensions, these clusters can maintain separation that makes nearest neighbor search useful.
Let me give a concrete example: Consider a dataset of images. While an image might be represented in a very high-dimensional space (e.g., thousands of pixels), images of dogs will tend to be closer to other dog images than to images of cars, even in this high-dimensional space. The meaningful features create a structure that nearest neighbor search can exploit.
Spam filtering is another area where nearest neighbor is used to good effect. When you know that a certain embedding representing a spam message (in any medium - email, comments, whatever), then if other messages come along and are _relatively_ close to that one, you may conclude that they are on the right side of the manifold to be considered spam.
You could train a special model to define this manifold, but spam changes all the time and constant training doesn’t work well.
Since embeddings are the middle layer of an ANN, doesn't this suggest that there are too many dimensions used during training. I would think a training goal would be to have relatively uniform coverage of the space
The high dimensionality of embeddings provides several key advantages during training and inference. The additional dimensions allow the network to better preserve both local and global relationships between data points, while providing the capacity to capture subtle semantic nuances. This extra capacity helps prevent information bottlenecks and provides more pathways for gradient descent during training, leading to more stable optimization. In essence, the high-dimensional nature of embeddings, despite its counterintuitive properties, is a feature rather than a bug in neural network architecture.
TL;DR: High-dimensional embeddings in neural networks are actually beneficial - the extra dimensions help keep different concepts clearly separated and make training more stable, even though the distances between points become more uniform.
They are/can be anyway. I had data with 50,000 dimensions, which after applying dimensionality reduction techniques got it "down to" around 300! ANN worked very well on those vectors. This was prior to the glut of vector dbs we have available now, so it was all in-memory and used direct library calls to find neighbors.
If clustering works or not has nothing to do with the dimensionality of the space, and everything to do with the distribution of the points.
Use approximate NN search when you have high volume of searches over millions of vectors.
https://planetscale.com/blog/announcing-planetscale-vectors-...
I'm sure it works great, but at that price point, I'm stuck with self-hosting Postgres+pgvector.
But yes, I it seems extereme. But it is also cheaper than hiring a dedicated postgres/db guy who will cost 5 to 10x more per month.
The null hypothesis is that some one point (a vector in R^n for the reals R and positive integer n) is from an independent, identically distributed set of random variables. The cute part was the use of the distribution property of tightness for the test statistic. The intended first application was for monitoring computer systems and improve on the early AI expert systems.
https://www.microsoft.com/en-us/research/uploads/prod/2021/1...