In my case, I was trying to perform synchronization over a set of noisy point clouds to extract the "ground truth" point cloud. I.e., take the set of coordinates corresponding to a point cloud, randomly permute the order of the coordinates, randomly rotate/reflect the point cloud, and then apply a Gaussian perturbation to each coordinate. Repeat this process multiple times. The goal then is to recover the original point cloud from the noisy rotated/reflected/permuted copies of the original.
Boumal and Singer had done some work to solve this problem for just rotations/reflections by essentially "stacking" the unknown orthogonal matrices (i.e., elements of O(3)) into a Stiefel manifold and then performing Riemannian optimization over this manifold. It worked fantastically well and was much faster than prior techniques involving semidefinite programming, so I decided to try the same approach for synchronization over permuted copies of the point cloud, and it also worked quite well. However, one can achieve better convergence properties by performing a discrete Fourier transform over the symmetric group on the sets of point clouds so that instead of optimizing over a Stiefel manifold that is supposed to represent stacked permutation matrices (well, stacked orthogonal matrices with constraints to make them doubly stochastic), you optimize over a manifold of stacked unitary matrices intended to represent irreps of S_n.
Ultimately I didn't know enough group theory at the time to figure out how to generate irreps for the "combined" group of O(3) and S_n, so I couldn't solve the problem simultaneously for both rotations/reflections and permutations, but ultimately Singer and Boumal developed an approach that bypassed the synchronization problem altogether by utilizing invariant polynomials to extract the ground truth signal in a more direct way.
(My group theory background is only from abstract algebra, quantum, and a bit of poking around)
https://press.princeton.edu/absil
I was unaware of the Bournal work, so thanks for that. Do you have any idea how Bournal's approach differs from Absil's?
For others, it looks like Bournal also has a book on the topic from 2023:
When my supervisor read about fedma using matching he went "oh god, i have terrible memories about it". he was right
Has this been applied to networks like DETR for object detection? I’ve never been clear on how they made the Hungarian algorithm differentiable (and it seems like their approach has issues).
In math programming, we create a relaxation of the assignment problem where all of the integer decisions are converted to continuous. Then you solve this super simple Linear Program. This is a theoretical lower bound on what the optimal assignment is. You can now start branch and bound search and find the optimal integer assignment.
Now the trick is that we have identified over the decades cuts (inequalities) that you can apply to this initial relaxed problem, that they chop off irrelevant continuous space but let the integer solutions are untouched. If your cuts are good enough and the integer optimal solution is at the vertex of the chopped polytope, congratulations! You don't have to search anything!
Moreover, well-known algorithms for linear programming like simplex and primal-dual methods have straightforward (and illuminating!) combinatorial interpretations when restricted to minimum-weight bipartite matching.
I you have any resource illustrating this point, please share. I would be very interested.
I liked this video [0] about how the Hungarian Algorithm is a particular case of the primal-dual method and I am eager to dig further.
E.g. in https://arxiv.org/abs/1905.11885 "Differentiable ranks and sorting using optimal transport" they show how a relaxation of an array sorting procedure leads to differentiable rank statistics (quantiles etc.). The underlying theory is quite close to what I show in the blog post.
In DETR they don't actually use a differentiable Hungarian algorithm, it's only used to score their neural predictions outside of the training loop IIUC.
E.g., graph layouts are notoriously hard to glean any good information from, especially for larger graphs. Tossing in a few constraints to enforce a topological sort, take up the whole page, keep nodes from overlapping, reduce crossings, have more or fewer clusters, ..., you can easily adjust a few sliders and get the perfect layout for a given problem.
I think this would be easy to express in that, and I wonder how the performance would compare. Basically, there can be DSL-level facilities to track that functions and constraints are convex by construction, and remove the obligation to implement various bits of optimization logic (and allow one to switch between optimization details at a config level).
min sum_i sum_j c_{ij}*x_{ij}
s.t. sum_i x_{ij} == 1, for all j
sum_j x_{ij} == 1, for all i
and you automagically get an integer solution in the x_{ij} due to the structure of the constraints.CVXPY would be a good way to implement this these days.
I thought the formulas for updates in this case were pretty opaque, so let's think of a simpler example. Say we want to optimize a function f(x_1, ..., x_n) over the simplex
\Delta = \{ (x_1, ..., x_n) : x_1 + ... + x_n = 1, x_i >= 0. \}
We can imagine that we're allocating a limited resource into n different bins. At each optimization step, say we have numbers g_i telling us partials of f with respect to the x_i. (Somewhat more generally, these numbers could tell us much we "regret" having allocated our resource into the ith bin.) How can we use this information to change our choices of x_i?
One option would be to take vanilla gradient descent steps. That is, we'd choose some small eta, set x_i := x_i - eta g_i, and somehow adjust these assignments to be a valid allocation. Another method, which tends to be more natural and do better in practice, is to set x_i := x_i exp(-eta g_i) and then divide each x_i by (x_1 + ... + x_n). (This is called the Hedge method, and "mysteriously" also looks just like a Bayesian update for a distribution of belief over n possibilities.) This is the flavor of method being used in the post to optimize over doubly stochastic matrices.
I've studied differential geometry but still have the impression that the proximal operator/mirror descent point of view is a little more natural in optimization/learning. Gesticulating wildly about Levi-Citiva connections is fun, but e.g. to optimize over the simplex it makes more sense to just ask: how should we regularize our updates to not be too "imprudent"? (Of course this depends on what problem you're solving!) One common story is that you want to avoid neglecting bins until you're very sure they're worthless. To do this, you can use a metric that makes your manifold look larger when you get near a vertex of the simplex (and then make sure you can do the necessary computations and find a good retraction map!), or you can think about what proximal operator to use/what Bregman divergence to use to regularize your updates. The paper used by this post uses a Fisher metric, which is the infinitesimal version of the KL divergence, and regularizing your updates by KL gives you things like the Hedge method/Bayesian updates.
[A] https://www.quantamagazine.org/researchers-achieve-absurdly-...