Bayesian Neural Networks

258 points by reqo 6 days ago | 54 comments

datastoat 3 days ago |
I like Bayesian inference for few-parameter models where I have solid grounds for choosing my priors. For neural networks, I like to ask people "what's your prior for ReLU versus LeakyReLU versus sigmoid?" and I've never gotten a convincing answer.
salty_biscuits 3 days ago |
I'm sure there is a way of interpreting a relu as a sparsity prior on the layer.
pkoird 3 days ago |
Kolmogorov Arnold nets might have an answer for you!
jwuphysics 3 days ago |
Could you say a bit more about how so?
pkoird 3 days ago |
KANs have learnable activations based on splines parameterized on few variables. You can specify a prior over those variables, effectively establishing a prior over your activation function.
dccsillag 3 days ago |
Ah, Kolmogorov Arnold Networks. Perhaps the only model I have ever tried that managed to fairly often get AUCs below 0.5 in my tabular ML benchmarks. It even managed to get a frankly disturbing 0.33, where pretty much any other method (including linear regression, IIRC) would get >=0.99!
SpaceManNabs 2 days ago |
Why do you think they perform so poorly?
dccsillag 2 days ago |
Theory-wise, I'm not convinced that the models have good approximation properties (the Kolmogorov-Arnold / Kolmogorov Superposition Theorem they base themselves on has quite a bit of nuance), and the optimization problem might be a bit tricky. I'm also can't see how to incorporate inductive biases other than the standard R^n / tabular regression one, and the existing attempts on this that I'm aware of are just band-aids (along the lines of feature engineering).
In practice, I've personally ran some benchmarks on a collection of datasets I had laying around. The results were generally abysmal, with the method only matching simple baselines in some few datasets.
Finally, the original paper is very weird, and reads more as a marketing piece. The theory, which is touted throughout the paper, is very weak, the actual algorithm is not sufficiently well explained there and the experiments are lacking. In particular, I find it telling that they do not include and even go out of their way to ignore important baselines such as boosted trees, which are the state-of-the-art solution to the problem that they intended to solve (and even work very well in occasions where they claim that both KANs and MLPs perform badly, e.g. in high dimensions).
SpaceManNabs 2 hours ago |
Thanks for the detailed answer. So I guess the main issue with KANs is that they don't work as good. I wonder if that shortfall could be because we have spent more time setting up KANs for learning as much as we can for things like MLPs. I am not surprised though that KANs don't beat boosted trees and such. MLPs dont really either.
Only one follow up question:
> I'm also can't see how to incorporate inductive biases other than the standard R^n / tabular regression one, and the existing attempts on this that I'm aware of are just band-aids (along the lines of feature engineering)
A lot of the way we induct biases in the traditional network setting (activations are on the node instead of on the edge like in KAN) is by using graph-based architectures, like convolution or transformers, or by setting up particular losses and optimizations like in equivariant networks. Can't we do the same thing for KANs?
duvenaud 3 days ago |
I agree choosing priors is hard, but choosing ReLU versus LeakyReLU versus sigmoid seems like a problem with using neural nets in general, not Bayesian neural nets in particular. Am I misunderstanding?
stormfather 2 days ago |
I choose LeakyReLU vs ReLU depending on if it's an odd day of the week, LeakyReLU being the slightly favored odd-days because it's aesthetically nicer that gradients propagate through negative inputs, though I can't discern a difference. I choose sigmoid if I want to waste compute to remind myself that it converges slowly due to vanishing gradients at extreme activation levels. So its empiricism retroactively justified by some mathematical common sense that let's me feel good about the choices. Kind of like aerodynamics.
sideshowb 3 days ago |
I like Bayes, but I thought the "surprising" result is that double descent is supposed to prevent nns from overfitting?
duvenaud 3 days ago |
Good point. We wrote this pre-double descent, and a massively overparameterized model would make a nice addition to the tutorial as a baseline. However, if you want a rich predictive distribution, it might still make sense to use a Bayesian NN.
dccsillag 3 days ago |
Bayesian Neural Networks just seem like a failed approach, unfortunately. For one, Bayesian inference and UQ fundamentally depends on the choice of the prior, but this is rarely discussed in the Bayesian NN literature and practice, and is further compounded by how fundamentally hard to interpret and choose these priors are (what is the intuition behind a NN's parameters?). Add to that the fact that the Bayesian inference is very much approximate, and you should see the trouble.
If you want UQ, 'frequentist nonparametric' approaches like Conformal Prediction and Calibration/Multi-Calibration methods seem to work quite well (especilly when combined with the standard ML machinery of taking a log-likelihood as your loss), and do not suffer from any of the issues above while also giving you formal guarantees of correctness. They are a strict improvement over Bayesian NNs, IMO.
bravura 3 days ago |
Conformal learning is relatively new to me. Tell me if I'm getting any of this wrong: Conformal learning is a frequentist approach that uses a calibration set to determine how unusual a prediction is.
It seems like the main time they aren't a strict improvement over bayesian methods is when it is difficult to define your calibration set? I know this scenario isn't so commonplace, but I'm working in a scenario where I quickly looked at conformal learning and wasn't sure if it is applicable.
dccsillag 3 days ago |
That's a particular form of Conformal Prediction, called Split Conformal Prediction. Incidentally, it's also one of the best ones (i.e., most extensible, strongest guarantees, easiest to implement, remarkably sample-efficient).
Making a calibration set is pretty easy, it's just a data split (just like the train/test split). The hardest part (which is still fairly easy) is creating a 'conformity score', which is a function that receives the input and a candidate output and scores how well this candidate output 'conforms' to the input. This is where an underlying ML model can come in handy: it can, itself, estimate this! Split Conformal Prediction then does a fairly simple quantile calculation on these scores (or some variant thereof) to then form the set prediction.
In a sense, you could use Bayesian NNs to produce a conformity score. But that doesn't seem to be much better than just using e.g. the model's logits for your conformity score. Theory-wise, Conformal Prediction methods have a number of favorable guarantees that Bayesian models (and especially Bayesian NNs) generally don't, and in practice we've seen that conditional on the model giving calibrated outputs (which is guaranteed for Conformal Prediction, but not for Bayesian NNs), Conformal Prediction predicted sets seem to be tighter than the Bayesian NN ones.
duvenaud 3 days ago |
I agree that Bayesian neural networks haven't been worth it in practice for many applications, but I think the main problem is that it's usually better to spend your compute training a single set of weights for a larger model, rather than doing approximate inference over weights in a smaller model. The exception is probably scientific applications where you mostly know the model, but then you don't really need a neural net anymore.
Choosing a prior is hard, but I'd say it's analogously hard to choosing an architecture - if all else fails, you can do a brute force search, and you even have the marginal likelihood to guide you. I don't think it's the main reason why people don't use BNNs much.
dkga 3 days ago |
I disagree with one conceptual point; if you are truly Bayesian you don’t “choose” a prior, by definition you “already have” a prior that you are updating with data to get to a posterior.
hgomersall 3 days ago |
At some level, you have to choose something. You can't know every level in your hierarchy.
abm53 3 days ago |
100% correct, but there are ways to push Bayesian inference back a step to justify this sort of thing.
It of course makes the problem even more complex and likely requires further approximations to computing the posterior (or even the MAP solution).
This stretches the notion that you are still doing Bayesian reasoning but can still lead to useful insights.
DiscourseFan 3 days ago |
Probably should just call it something else then; though, I gather that the simplicity of Bayes theorom belies the complexity of what it hides.
duvenaud 2 days ago |
Sure, instead of saying "choose" a prior, you could say "elicit". But I think in this context, focusing on a practitioner's prior knowledge is missing the point. For the sorts of problems we use NNs for, we don't usually think that the guy designing the net has important knowledge that would help making good predictions. Choosing a prior is just an engineering challenge, where one has to avoid accidentally precluding plausible hypotheses.
waldrews 3 days ago |
The Conformal Prediction advocates (especially a certain prominent Twitter account) tend to rehash old frequentist-vs-bayesian arguments with more heated rhetoric than strictly necessary. That fight has been going on for almost a century now. Bayesian counterargument (in caricature form) would be that MLE frequentists just choose an arbitrary (flat) prior, and penalty hyperparameters (common in NN) are a de facto prior. The formal guarantees only have bite in the asymptotic setting or require convoluted statements about probabilities over repeated experiments; and asymptotically, the choice of prior doesn't matter anyway.
(I'm a moderate that uses both approaches, seeing them as part of a general hierarchical modeling method, which means I get mocked by either side for lack of purity).
Bayesians are losing ground at the moment because their computational methods haven't been advanced as fast by the GPU revolution for reasons having to do with difficulty in parallelization, but there's serious practical work (especially using JAX) to catch up, and the whole normalizing flow literature might just get us past the limitations of MCMC for hard problems.
But having said that, Conformal Prediction works as advertised for UQ as a wrapper on any point estimating model. If you've got the data for it - and in the ML setting you do - and you don't care about things like missing data imputation, error in inputs, non-iid spatio-temporal and hierarchical structures, mixtures of models, evidence decay, unbalanced data where small-data islands coexist big data - all the complicated situations where Bayesian methods just automatically work and other methods require elaborate workarounds, yup, use Conformal Prediction.
Calibration is also a pretty magical way to improve just about any estimator. It's cheap to do and it works (although hard to guarantee anything with that in the general case...)
And don't forget quantile regression penalties! Awkward to apply in the NN setting, but an easy and effective way to do UQ in XGBoost world.
dccsillag 3 days ago |
Yeah, I know the account you are talking about, it really is a bit over the top. It's a shame, I've met a bunch of people who mentioned that they were actually turned away from Conformal Prediction due to them.
> But having said that, Conformal Prediction works as advertised for UQ as a wrapper on any point estimating model. If you've got the data for it - and in the ML setting you do - and you don't care about things like missing data imputation, error in inputs, non-iid spatio-temporal and hierarchical structures, mixtures of models, evidence decay, unbalanced data where small-data islands coexist big data - all the complicated situations where Bayesian methods just automatically work and other methods require elaborate workarounds, yup, use Conformal Prediction.
Many of these things can actually work really well with Conformal Prediction, but the algorithms require extensions (much like if you are doing Bayesian inference, you also need to update your model accordingly!). They generally end up being some form of reweighting to compensate for the distribution shifts (excluding the Online Conformal Prediction literature, which is another beast entirely). Also, worth noting that if you have iid data then Conformal Prediction is remarkably data-efficient; as little as 20 samples are enough for it to start working for 95% predictive intervals, and with 50 samples (and with almost surely unique conformity scores) it's going to match 95% coverage fairly tightly.
3abiton 2 days ago |
Are we talking about NN Taleb? I am curious about the twitter persona.
GemesAS 2 days ago |
Someone by the name of V. Minakhin. They have an irrational hatred of Bayesian statistics. He blocked me on twitter for pointing out his claim about significant companies do not use Bayesian methods is contradicted by the fact that I work for one of those companies and use Bayesian methods.
travisjungroth 2 days ago |
Netflix uses Bayesian methods all over the place. In a meeting presenting new methods, I called squinting at A/B test results and considering them in the context of prior knowledge "shoot-from-the-hip cowboy Bayes". This eventually lead to a Cowboy Bayes T-shirt, hat and all.
ComplexSystems 2 days ago |
"Bayesian counterargument (in caricature form) would be that MLE frequentists just choose an arbitrary (flat) prior, and penalty hyperparameters (common in NN) are a de facto prior."
This has been my view for a while now. Is this not correct?
In general, I think the idea of a big "frequentist vs Bayesian" debate is silly. I think it is very useful to take frequentist ideas and see what they look like from a Bayesian point of view, and vice versa (when applicable). I think this is pretty much the general stance among most people in the field - it's generally expected that one will understand that regularization methods equate to certain priors, for instance, and in general be able to relate these two perspectives as much as possible.
duvenaud 2 days ago |
I would argue against the idea that "MLE is just Bayes with a flat prior". The power of Bayes usually comes mainly from keeping around all the hypothesis that are compatible with the data, not from the prior. This is especially true in domains where something black-box (essentially prior-less) like a neural net has any chance of working.
fjkdlsjflkds 3 days ago |
> For one, Bayesian inference and UQ fundamentally depends on the choice of the prior, but this is rarely discussed in the Bayesian NN literature and practice, and is further compounded by how fundamentally hard to interpret and choose these priors are (what is the intuition behind a NN's parameters?).
I agree that, computationally, it is hard to justify the use of Bayesian methods on large-scale neural networks when stochastic gradient descent (and friends) is so damn efficient and effective.
On the other hand, the fact that there's a dependence on (subjective) priors is hardly a fair critique: non-Bayesian training of neural networks also depends on the use of (subjective) loss functions with (subjective) regularization terms (in fact, it can be shown that, mathematically, the use of priors is precisely equivalent to adding regularization to a loss function). Non-Bayesian training of neural networks is not "a failed approach" just because someone can arbitrarily choose L1 regularization (i.e., a Laplacian prior) over L2 regularization (i.e., a Gaussian prior).
Furthermore, we do have some intuition over NN parameters (particularly when inputs and outputs are properly scaled): a value of 10^15 should be less likely than a value of 0. Note that, in Bayesian practice, people often use weakly-informative priors (see, e.g., http://www.stat.columbia.edu/~gelman/presentations/weakprior...) to encode such intuitive statements while ensuring that (for all practical purposes) the data will effectively overwhelm the prior (again, this is equivalent to adding a minimal amount of regularization to a loss function, to make a problem well-posed when e.g. you have more parameters than data points).
datastoat 2 days ago |
Non-Bayesian NN training does indeed use regularizers that are chosen subjectively —- but they are then tested in validation, and the best-performing regularizer is chosen. Thus the choice is empirical, not subjective.
A Bayesian could try the same thing: try out several priors, and pick the one that performs best in validation. But if you pick your prior based on the data, then the classic theory about “principled quantification of uncertainty” doesn’t apply any more. So you’re left using a computationally unwieldy procedure that doesn’t offer theoretical guarantees.
panda-giddiness 2 days ago |
You can, in fact, do that. It's called (aptly enough) the empirical Bayes method. [1]
[1] https://en.wikipedia.org/wiki/Empirical_Bayes_method
datastoat 2 days ago |
Empirical Bayes is exactly what I was getting at. It's a pragmatic modelling choice, but it loses the theoretical guarantees about uncertainty quantification that pure Bayesianism gives us.
(Though if you have a reference for why empirical Bayes does give theoretical guarantees, I'll be happy to change my mind!)
fjkdlsjflkds 2 days ago |
> Non-Bayesian NN training does indeed use regularizers that are chosen subjectively —- but they are then tested in validation, and the best-performing regularizer is chosen. Thus the choice is empirical, not subjective.
I'd argue the choice is still subjective, since you are still only testing over a limited (subjective) set of options. If you are doing this properly (i.e., using an independent validation set), then you can apply the same approach to a Bayesian method and obtain the same type of information ("when I use prior A vs. prior B, how does that change the generalization/out-of-bag error properties of my model?"), without violating any properties or theoretical guarantees of "Bayesianism".
> A Bayesian could try the same thing: try out several priors, and pick the one that performs best in validation. But if you pick your prior based on the data, then the classic theory about “principled quantification of uncertainty” doesn’t apply any more.
If you subjectively define a set of possible priors (i.e., distributions and parameters) to test in a validation setting, then you are not picking your prior based on the data (again, assuming that you have set up a leakage-free partition of your data in training and validation data), and you are not doing empirical Bayes, so you are not violating any supposed "principled quantification of uncertainty" (if you believe that applying a standard subjective Bayesian approach provides you with "principled quantification of uncertainty").
My point was that, in practice, there are ways of choosing (subjective) priors such that they provide sufficient regularization while ensuring that their impact on the results is minimized, particularly when you can assume certain things about the scale of data (and, in the context of neural networks, you often can, due to things like "normalization layers" and prior scaling of inputs and outputs): "subjective" doesn't have to mean "arbitrary".
> So you’re left using a computationally unwieldy procedure that doesn’t offer theoretical guarantees.
I won't argue about the fact that training NN using Bayesian approaches is computationally unwieldy. I just don't see how evaluating a modelling decision (be in Bayesian or non-Bayesian modelling), using a proper validation process, would violate any specific theoretical guarantees.
If you can explain to me how evaluating the generalization properties of a Bayesian training recipe on an independent dataset violates any specific theoretical guarantees, I would be thankful (note: as far as I am concerned, "principled quantification of uncertainty" is not a specific theoretical guarantee).
dkga 3 days ago |
I’m not an expert in BNNs but the prior does not need to be justified in terms of each parameter. Bayesian analysis frequently uses hyperparameters to set the overall tightness or looseness of the parameters (a la Minnesota priors in the econometric literature for example). This would be a similar regularisation intuition as, eg, L1 and L2 regularisation in traditional NN training. This is of course just one example.
nvrmnd 3 days ago |
What is 'UQ', I assume some measure of uncertainty over your model outputs?
rscho 3 days ago |
Unbiased quantifier
proto-n 3 days ago |
Usually means uncertainty quantification
scellus 2 days ago |
Priors on parameters are not an issue. On models of scale, priors are just some computationally convenient shrinkage, and what works is found empirically and canonized into the practice; projecting prior knowledge of the problem at hand by parameter priors does not really happen except in some vague sense ("I think most predictors are irrelevant, so make it sparse by Cauchy/horseshoe/whatever").
The important thing in bayesian (statistical, ML) modelling in general is the ability to gain in flexibility and do model structures that otherwise would be hard or impossible: latent states, hierarchies, etc.
In bayesian NNs the main advantages would be around uncertainty quantification (UQ) and in finding good optima and partly to avoid overfitting. These do apply in some cases of simple NNs.
Mostly however, especially with larger conventional models (not speaking of normalizing flows and such here), using explicit bayes is not feasible. Instead, people use approximate point estimates with tricks:
(1) UQ has been taken care of by post-calibration. (2) Stochastic gradient actually searches for large posterior masses like a variational approximation would do, so it is kind of bayes. (3) And those priors: using dropout is commonplace, it has a bayesian interpretation, and L2 regularization aka gaussian priors are frequent too.
So bayes is there in practice, just not in a neat, pure form but as a collection of practical hacks.
duvenaud 3 days ago |
Author here! What a surprise. This was an abandoned project from 2019, that we never linked or advertised anywhere as far as I know. Anyways, happy to answer questions.
esafak 3 days ago |
just a little typo, but it's Kullback-Leibler.
duvenaud 2 days ago |
Thanks for pointing that out!
idontknowmuch 3 days ago |
Somewhat related — I’d love to hear your thoughts on dex-Lang and Haskell for array programming?
duvenaud 2 days ago |
I still am excited by Dex (https://github.com/google-research/dex-lang/) and still write code in it! I have a bunch of demos and fixes written, and am just waiting for Dougal to finish his latest re-write before I can merge them.
mugivarra69 3 days ago |
why (if) was this not picked for further research? i know that oatml did quite amount of work on this front as well and it seems the direction is still being worked on. want to get ur 2 cent on this approach.
duvenaud 2 days ago |
BNNs certainly have their uses, but I think people in general found that it's a better use of compute to fit a larger model on more data than to try to squeeze more juice from a given small dataset + model. Usually there is more data available, it's just somewhat tangentially related. LLMs are the ultimate example of how training on tons of tangentially-related data can ultimately be worthwhile for almost any task.
timeinput 2 days ago |
What did you use to produce the article? I really really like the formatting.
duvenaud 2 days ago |
I think we used a distill.pub template. Also Jerry wrote some custom BNN fitting code in javascript. I'll ask my co-authors to open-source it.
duvenaud 2 days ago |
Update: the code is here:
https://github.com/jerryqhyu/distill_bayes_net
oli5679 2 days ago |
https://publications.aston.ac.uk/id/eprint/373/1/NCRG_94_004...
mixture density networks are quite interesting if you want probabilistic estimates of neural. here, your model learns to output and array of gaussian distribution coefficient distributions, and mixture weights.
these weights are specific to individual observations, and trained to maximise likelihood.
duvenaud 2 days ago |
This approach characterizes a different type of uncertainty than BNNs do, and the approaches can be combined. The BNN tracks uncertainty about parameters in the NN, and mixture density nets track the noise distribution _conditional on knowing the parameters_.
ok123456 2 days ago |
BNNs were an attractive choice in scenarios where the data is expensive to collect, like actual physical experiments. But boosting and other tree-based regression methods give you similar performance with a more straightforward framework for limited tabular data.
levocardia 2 days ago |
What frustrates me about Bayesian NNs is that talking about "priors" doesn't make nearly as much sense as it does in a regression context. A prior over parameter weights has no interpretation in the way that a prior over a regression coefficient, or even a spline smoothness, does. What you really want -- and what natural intelligence probably has -- are priors over aspects of the world.
Francois Chollet's paper on measuring intelligence was really informative for me on this front; the "priors" you should have about the world are not half-cauchys over certain hyperparameters or whatever, but priors about agent-ness, object-ness, goal-oriented-ness, and so on. How to encode that in a network...well, that's the real trick, right?
duvenaud 2 days ago |
I agree that priors over aspects of the world would be more useful, but I don't think that they're important in making natural intelligence powerful. In my experience, the important thing is to make your prior really broad, but containing all kinds of different hypotheses with different kinds of rich structure.
I claim that knowing a priori about things like agents and objects just doesn't save you all that much data, as long as you have the imagination to consider all structures at least that complex.