Show HN: TabPFN v2 – A SOTA foundation model for small tabular data

99 points by onasta 16 hours ago | 19 comments

I am excited to announce the release of TabPFN v2, a tabular foundation model that delivers state-of-the-art predictions on small datasets in just 2.8 seconds for classification and 4.8 seconds for regression compared to strong baselines tuned for 4 hours. Published in Nature, this model outperforms traditional methods on datasets with up to 10,000 samples and 500 features.

The model is available under an open license: a derivative of the Apache 2 license with a single modification, adding an enhanced attribution requirement inspired by the Llama 3 license: https://github.com/PriorLabs/tabpfn. You can also try it via API: https://github.com/PriorLabs/tabpfn-client

TabPFN v2 is trained on 130 million synthetic tabular prediction datasets to perform in-context learning and output a predictive distribution for the test data points. Each dataset acts as one meta-datapoint to train the TabPFN weights with SGD. As a foundation model, TabPFN allows for fine-tuning, density estimation and data generation.

Compared to TabPFN v1, v2 now natively supports categorical features and missing values. TabPFN v2 performs just as well on datasets with or without these. It also handles outliers and uninformative features naturally, problems that often throw off standard neural nets.

TabPFN v2 performs as well with half the data as the next best baseline (CatBoost) with all the data.

We also compared TabPFN to the SOTA AutoML system AutoGluon 1.0. Standard TabPFN already outperforms AutoGluon on classification and ties on regression, but ensembling multiple TabPFNs in TabPFN v2 (PHE) is even better.

There are some limitations: TabPFN v2 is very fast to train and does not require hyperparameter tuning, but inference is slow. The model is also only designed for datasets up to 10k data points and 500 features. While it may perform well on larger datasets, it hasn't been our focus.

We're actively working on removing these limitations and intend to release new versions of TabPFN that can handle larger datasets, have faster inference and perform in additional predictive settings such as time-series and recommender systems.

We would love for you to try out TabPFN v2 and give us your feedback!

OutOfHere 16 hours ago |
Related repo: https://github.com/liam-sbhoo/tabpfn-time-series
instanceofme 16 hours ago |
Related: CARTE-AI, which can also deal with multiple tables.
https://soda-inria.github.io/carte/ https://arxiv.org/pdf/2402.16785
The paper includes a comparison to TabPFN v1 (among others), noting the lack of categorical & missing values handling which v2 now seems to have. Would be curious to see an updated comparison.
onasta 10 hours ago |
TabPFN is better on numerical data since v1 (see figure 6 in the CARTE paper). CARTE's main strength in on text features, which are now also supported for TabPFN v2 API version (https://github.com/PriorLabs/tabpfn-client). We compared this to CARTE and found our model to be generally quite better, and much faster. CARTE multi-table approach is also very interesting, and we want to tackle this setting in the future.
ggnore7452 11 hours ago |
anyone tried this? is this actually overall better than xgboost/catboost?
westurner 10 hours ago |
Benchmark of tabpfn<2 compared to xgboost, lightgbm, and catboost: https://x.com/FrankRHutter/status/1583410845307977733 .. https://news.ycombinator.com/item?id=33486914
enigmaa99 9 hours ago |
Yes it actually is but the limitations of rows and features could be a hindrance.
bbstats 11 hours ago |
looks amazing - finally, DL that beats a tuned catboost?
_giorgio_ 10 hours ago |
It's probably the same model with the same limitations, released nearly two years ago?
https://arxiv.org/abs/2207.01848
ersiees 10 hours ago |
No, it is *much* stronger, a different architecture and scales to 10x the number of examples. It can also do regression now, and handle categorical features. Please, have a quick look at the abstract before making such claims.
onasta 10 hours ago |
There have been a ton of improvements! Much better performance overall, way larger data size limit (1K-->10K rows, 100-->500 features), regression support, native categorical data and missing values handling, much better support for uninformative or outlier features etc.
enigmaa99 9 hours ago |
I tried this on a few CARTE datasets and it works surprisingly better!! Woahhh
peepeepoopoo99 8 hours ago |
How can you train a tabular foundation model when the tabular features themselves are inherently domain-specific? Is there some kind of preprocessing step beforehand to match the inference time features with their closest analogues in the training set?
hooloovoo_zoo 7 hours ago |
Were your benchmark methods tuned per dataset or across datasets?
ersiees 19 minutes ago |
Tuned per dataset
gcr 6 hours ago |
Thanks for such a cool project! It's immediately apparent how to use it and I appreciate the brief examples.
Quick question: In the breast cancer example from the README, simple support vector machines from sklearn (the first thing i tried to compare baseline performance, incidentally) seem to outperform TabPFN. Is this expected? I know it's a baseline to demonstrate ease of use rather than SOTA performance, but I am curious.
# (TabPFN) In [13]: print("ROC AUC:", roc_auc_score(y_test, prediction_probabilities[:, 1])) ROC AUC: 0.996299494264216 # (LinearSVC) In [27]: from sklearn.svm import LinearSVC In [28]: clf=LinearSVC(C=0.01).fit(X_train, y_train) In [29]: roc_auc_score(y_test, clf.decision_function(X_test)) Out[29]: 0.997532996176144
nickpsecurity 5 hours ago |
A while back, I was looking for a project amateurs could do for experimenting with Transformer alternatives and optimization algorithms. My concept was grabbing objective, test functions from the literature, making custom ones based on realistic data, and layering them together based on real-world depth. Then, training various approaches on them using consumer GPU’s or spot instances of high-end GPU’s.
What I read in this paper blew that idea out the water! I mean, it’s still doable but you’ve far exceeded it.
I love that you covered many types of structures, used 8x consumer GPU’s more like OSS folks do (widely-accessible pretraining), claim no copyright infringement for pretraining, and use enough techniques in ML that people can enjoy Googling stuff for days.
I do have some questions about what I might have overlooked in the paper.
1. Is the training data and code available to reproduce the model? And iteratively improve its architectural decisions?
2. Most authors claiming their data was legal or open were actually committing copyright infringement. Your method might dodge that if users generate their own synthetic data using methods they can verify aren’t themselves encumbered. Is that code available under open licensing? If not, would you offer it for a fee for companies or free for researchers?
3. What specific, common uses could amateurs try that would display the model’s ability in a business setting? (Both to drive more research or build products on the model.)
I thank you for your time.
tmostak 4 hours ago |
This looks amazing!
Just looking through the code a bit, it seems that the model both supports a (custom) attention mechanism between features and between rows (code uses the term items)? If so, does the attention between rows help improve accuracy significantly?
Generally, for standard regression and classification use cases, rows (observations) are seen to be independent, but I'm guessing cross-row attention might help the model see the gestalt of the data in some way that improves accuracy even when the independence assumption holds?
dist-epoch 6 minutes ago |
Speculating, cross-row might give you information where you are in that row distribution.
patcon 41 minutes ago |
Neat! Might this even be useful to impute missing data for a sparse network of votes, for a system like this (pol.is) whose goal is to do dimensional reduction and visualise the opinion space of divisive social topics: https://gwern.net/doc/sociology/2021-small.pdf
200 voters on 50 statements would fall within the 10,000 sample threshold. This is well within the bounds of some existing conversations with open data, so it could be tested... Potential values on each statement are agree/disagree/pass (+1/-1/0)
https://github.com/compdemocracy/openData/blob/master/brexit...
https://github.com/compdemocracy/openData/blob/master/brexit...