In practice, I've personally ran some benchmarks on a collection of datasets I had laying around. The results were generally abysmal, with the method only matching simple baselines in some few datasets.
Finally, the original paper is very weird, and reads more as a marketing piece. The theory, which is touted throughout the paper, is very weak, the actual algorithm is not sufficiently well explained there and the experiments are lacking. In particular, I find it telling that they do not include and even go out of their way to ignore important baselines such as boosted trees, which are the state-of-the-art solution to the problem that they intended to solve (and even work very well in occasions where they claim that both KANs and MLPs perform badly, e.g. in high dimensions).
Only one follow up question:
> I'm also can't see how to incorporate inductive biases other than the standard R^n / tabular regression one, and the existing attempts on this that I'm aware of are just band-aids (along the lines of feature engineering)
A lot of the way we induct biases in the traditional network setting (activations are on the node instead of on the edge like in KAN) is by using graph-based architectures, like convolution or transformers, or by setting up particular losses and optimizations like in equivariant networks. Can't we do the same thing for KANs?
If you want UQ, 'frequentist nonparametric' approaches like Conformal Prediction and Calibration/Multi-Calibration methods seem to work quite well (especilly when combined with the standard ML machinery of taking a log-likelihood as your loss), and do not suffer from any of the issues above while also giving you formal guarantees of correctness. They are a strict improvement over Bayesian NNs, IMO.
It seems like the main time they aren't a strict improvement over bayesian methods is when it is difficult to define your calibration set? I know this scenario isn't so commonplace, but I'm working in a scenario where I quickly looked at conformal learning and wasn't sure if it is applicable.
Making a calibration set is pretty easy, it's just a data split (just like the train/test split). The hardest part (which is still fairly easy) is creating a 'conformity score', which is a function that receives the input and a candidate output and scores how well this candidate output 'conforms' to the input. This is where an underlying ML model can come in handy: it can, itself, estimate this! Split Conformal Prediction then does a fairly simple quantile calculation on these scores (or some variant thereof) to then form the set prediction.
In a sense, you could use Bayesian NNs to produce a conformity score. But that doesn't seem to be much better than just using e.g. the model's logits for your conformity score. Theory-wise, Conformal Prediction methods have a number of favorable guarantees that Bayesian models (and especially Bayesian NNs) generally don't, and in practice we've seen that conditional on the model giving calibrated outputs (which is guaranteed for Conformal Prediction, but not for Bayesian NNs), Conformal Prediction predicted sets seem to be tighter than the Bayesian NN ones.
Choosing a prior is hard, but I'd say it's analogously hard to choosing an architecture - if all else fails, you can do a brute force search, and you even have the marginal likelihood to guide you. I don't think it's the main reason why people don't use BNNs much.
It of course makes the problem even more complex and likely requires further approximations to computing the posterior (or even the MAP solution).
This stretches the notion that you are still doing Bayesian reasoning but can still lead to useful insights.
(I'm a moderate that uses both approaches, seeing them as part of a general hierarchical modeling method, which means I get mocked by either side for lack of purity).
Bayesians are losing ground at the moment because their computational methods haven't been advanced as fast by the GPU revolution for reasons having to do with difficulty in parallelization, but there's serious practical work (especially using JAX) to catch up, and the whole normalizing flow literature might just get us past the limitations of MCMC for hard problems.
But having said that, Conformal Prediction works as advertised for UQ as a wrapper on any point estimating model. If you've got the data for it - and in the ML setting you do - and you don't care about things like missing data imputation, error in inputs, non-iid spatio-temporal and hierarchical structures, mixtures of models, evidence decay, unbalanced data where small-data islands coexist big data - all the complicated situations where Bayesian methods just automatically work and other methods require elaborate workarounds, yup, use Conformal Prediction.
Calibration is also a pretty magical way to improve just about any estimator. It's cheap to do and it works (although hard to guarantee anything with that in the general case...)
And don't forget quantile regression penalties! Awkward to apply in the NN setting, but an easy and effective way to do UQ in XGBoost world.
> But having said that, Conformal Prediction works as advertised for UQ as a wrapper on any point estimating model. If you've got the data for it - and in the ML setting you do - and you don't care about things like missing data imputation, error in inputs, non-iid spatio-temporal and hierarchical structures, mixtures of models, evidence decay, unbalanced data where small-data islands coexist big data - all the complicated situations where Bayesian methods just automatically work and other methods require elaborate workarounds, yup, use Conformal Prediction.
Many of these things can actually work really well with Conformal Prediction, but the algorithms require extensions (much like if you are doing Bayesian inference, you also need to update your model accordingly!). They generally end up being some form of reweighting to compensate for the distribution shifts (excluding the Online Conformal Prediction literature, which is another beast entirely). Also, worth noting that if you have iid data then Conformal Prediction is remarkably data-efficient; as little as 20 samples are enough for it to start working for 95% predictive intervals, and with 50 samples (and with almost surely unique conformity scores) it's going to match 95% coverage fairly tightly.
This has been my view for a while now. Is this not correct?
In general, I think the idea of a big "frequentist vs Bayesian" debate is silly. I think it is very useful to take frequentist ideas and see what they look like from a Bayesian point of view, and vice versa (when applicable). I think this is pretty much the general stance among most people in the field - it's generally expected that one will understand that regularization methods equate to certain priors, for instance, and in general be able to relate these two perspectives as much as possible.
I agree that, computationally, it is hard to justify the use of Bayesian methods on large-scale neural networks when stochastic gradient descent (and friends) is so damn efficient and effective.
On the other hand, the fact that there's a dependence on (subjective) priors is hardly a fair critique: non-Bayesian training of neural networks also depends on the use of (subjective) loss functions with (subjective) regularization terms (in fact, it can be shown that, mathematically, the use of priors is precisely equivalent to adding regularization to a loss function). Non-Bayesian training of neural networks is not "a failed approach" just because someone can arbitrarily choose L1 regularization (i.e., a Laplacian prior) over L2 regularization (i.e., a Gaussian prior).
Furthermore, we do have some intuition over NN parameters (particularly when inputs and outputs are properly scaled): a value of 10^15 should be less likely than a value of 0. Note that, in Bayesian practice, people often use weakly-informative priors (see, e.g., http://www.stat.columbia.edu/~gelman/presentations/weakprior...) to encode such intuitive statements while ensuring that (for all practical purposes) the data will effectively overwhelm the prior (again, this is equivalent to adding a minimal amount of regularization to a loss function, to make a problem well-posed when e.g. you have more parameters than data points).
A Bayesian could try the same thing: try out several priors, and pick the one that performs best in validation. But if you pick your prior based on the data, then the classic theory about “principled quantification of uncertainty” doesn’t apply any more. So you’re left using a computationally unwieldy procedure that doesn’t offer theoretical guarantees.
(Though if you have a reference for why empirical Bayes does give theoretical guarantees, I'll be happy to change my mind!)
I'd argue the choice is still subjective, since you are still only testing over a limited (subjective) set of options. If you are doing this properly (i.e., using an independent validation set), then you can apply the same approach to a Bayesian method and obtain the same type of information ("when I use prior A vs. prior B, how does that change the generalization/out-of-bag error properties of my model?"), without violating any properties or theoretical guarantees of "Bayesianism".
> A Bayesian could try the same thing: try out several priors, and pick the one that performs best in validation. But if you pick your prior based on the data, then the classic theory about “principled quantification of uncertainty” doesn’t apply any more.
If you subjectively define a set of possible priors (i.e., distributions and parameters) to test in a validation setting, then you are not picking your prior based on the data (again, assuming that you have set up a leakage-free partition of your data in training and validation data), and you are not doing empirical Bayes, so you are not violating any supposed "principled quantification of uncertainty" (if you believe that applying a standard subjective Bayesian approach provides you with "principled quantification of uncertainty").
My point was that, in practice, there are ways of choosing (subjective) priors such that they provide sufficient regularization while ensuring that their impact on the results is minimized, particularly when you can assume certain things about the scale of data (and, in the context of neural networks, you often can, due to things like "normalization layers" and prior scaling of inputs and outputs): "subjective" doesn't have to mean "arbitrary".
> So you’re left using a computationally unwieldy procedure that doesn’t offer theoretical guarantees.
I won't argue about the fact that training NN using Bayesian approaches is computationally unwieldy. I just don't see how evaluating a modelling decision (be in Bayesian or non-Bayesian modelling), using a proper validation process, would violate any specific theoretical guarantees.
If you can explain to me how evaluating the generalization properties of a Bayesian training recipe on an independent dataset violates any specific theoretical guarantees, I would be thankful (note: as far as I am concerned, "principled quantification of uncertainty" is not a specific theoretical guarantee).
The important thing in bayesian (statistical, ML) modelling in general is the ability to gain in flexibility and do model structures that otherwise would be hard or impossible: latent states, hierarchies, etc.
In bayesian NNs the main advantages would be around uncertainty quantification (UQ) and in finding good optima and partly to avoid overfitting. These do apply in some cases of simple NNs.
Mostly however, especially with larger conventional models (not speaking of normalizing flows and such here), using explicit bayes is not feasible. Instead, people use approximate point estimates with tricks:
(1) UQ has been taken care of by post-calibration. (2) Stochastic gradient actually searches for large posterior masses like a variational approximation would do, so it is kind of bayes. (3) And those priors: using dropout is commonplace, it has a bayesian interpretation, and L2 regularization aka gaussian priors are frequent too.
So bayes is there in practice, just not in a neat, pure form but as a collection of practical hacks.
mixture density networks are quite interesting if you want probabilistic estimates of neural. here, your model learns to output and array of gaussian distribution coefficient distributions, and mixture weights.
these weights are specific to individual observations, and trained to maximise likelihood.
Francois Chollet's paper on measuring intelligence was really informative for me on this front; the "priors" you should have about the world are not half-cauchys over certain hyperparameters or whatever, but priors about agent-ness, object-ness, goal-oriented-ness, and so on. How to encode that in a network...well, that's the real trick, right?
I claim that knowing a priori about things like agents and objects just doesn't save you all that much data, as long as you have the imagination to consider all structures at least that complex.