Plotted results show around 50% average performance on "unseen" tasks, environments, objects etc, which sounds a lot like success follows some kind of random distribution. That's not a great way to engender trust in the "emergent" abilities of a robotic system to generalise to unseen tasks etc. Blame bad statistics if you get a strawberry in the eye, or a banana in the ear.
Note that at the level of quantum physics one might not be able to trust CPU instructions to be faultless. It is all about getting the right error margins.
Btw, computers are symbol manipulation machines and in general that's how we understand computation: manipulating symbols according to a set of rules; like Turing machines. Stochastic algorithms also ultimately work that way, and they will continue to until we can run all the modern AI stuff on, say, analog computers.
You have to remember that "unseen" means out of distribution performance, which used to be absolutely abhorrent and is now barely bearable.
Training with numbers like this might be a little problematic, I have tried to fine tune GPT 4o-mini with very little success(just me?)
On the other hand I found[1] Gemini and Molmo being able to locate elements on screen much better than 4o.
Some of the authors have gone on to found a startup called Physical Intelligence: https://www.physicalintelligence.company/blog/pi0
It's absolutely humbling how far we've come in such a short period of time. "Working in meat space" has always been touted as one of the last safe refuges from ML, but it seems like it's not going to be as unapproachable as we thought.