RT-2: Vision-Language-Action Models (2023)

76 points by elsewhen 9 days ago | 13 comments

xnx 9 days ago |
Impressive work. Connecting with Nvidia's move to make robotics there next focus, is there need to have powerful compute local to the robot? Cloud latency would seem to be fine for the speed of these robotic arms.
YeGoblynQueenne 9 days ago |
The problem with all the very impressive videos on that page is that we have no idea how many attempts were made before the robot could successfully, e.g. "put strawberry into the correct bowl". In that task there's four bowls, so a random choice of bowl would be correct 25% of the time. How many times did the robot put the strawberry e.g. in the bowl of apples? And that's assuming "the correct bowl" is the one with the strawberries (which is a big assumption- the strawberry should go with the others of its kind: says who? How often can the robot put the strawberry in the bowl with the apples if that's what we want it to do?).
Plotted results show around 50% average performance on "unseen" tasks, environments, objects etc, which sounds a lot like success follows some kind of random distribution. That's not a great way to engender trust in the "emergent" abilities of a robotic system to generalise to unseen tasks etc. Blame bad statistics if you get a strawberry in the eye, or a banana in the ear.
smokel 9 days ago |
For better or for worse, it is probably time to get accustomed to this. Early AI operated with symbols, and it has been clear for decades that that was a dead-end. Contemporary AI is stochastic, and it works a lot better for many applications than previous attempts, even considering the inherent uncertainty and errors.
Note that at the level of quantum physics one might not be able to trust CPU instructions to be faultless. It is all about getting the right error margins.
YeGoblynQueenne 8 days ago |
Accustomed- to tech demos that never lead to real-world capabilities? That's what I'm pointing out in my comment.
Btw, computers are symbol manipulation machines and in general that's how we understand computation: manipulating symbols according to a set of rules; like Turing machines. Stochastic algorithms also ultimately work that way, and they will continue to until we can run all the modern AI stuff on, say, analog computers.
imtringued 8 days ago |
My biggest concern isn't with the reliability, but more the fact that it has so far failed to accurately place the cans in every video.
You have to remember that "unseen" means out of distribution performance, which used to be absolutely abhorrent and is now barely bearable.
byyoung3 9 days ago |
this is a year and a half old
pkkkzip 9 days ago |
thanks for pointing this out. i knew not to get excited everytime something is posted on the frontpage.
GaggiX 9 days ago |
(2023)
mkagenius 9 days ago |
> We represent the robot actions as text strings as shown below. An example of such a string could be a sequence of robot action token numbers: “1 128 91 241 5 101 127 217”.
Training with numbers like this might be a little problematic, I have tried to fine tune GPT 4o-mini with very little success(just me?)
On the other hand I found[1] Gemini and Molmo being able to locate elements on screen much better than 4o.
1. https://github.com/BandarLabs/clickclickclick
yorwba 9 days ago |
They do not turn the actions into text that is then tokenized, but generate tokens directly. So the action token 128 doesn't necessarily correspond to the tokenization of the number 128 when it appears in text input. (Except for PaLI-X they make use of the fact that integers up to 1000 have unique tokens and do use those for the actions. But for PaLM-E, they hijack the 256 least frequently used tokens instead.)
modeless 8 days ago |
[2023]
Some of the authors have gone on to found a startup called Physical Intelligence: https://www.physicalintelligence.company/blog/pi0
mrshadowgoose 8 days ago |
I highly recommend the viewing of the video in the linked blog, which was posted in October 31, 2024.
It's absolutely humbling how far we've come in such a short period of time. "Working in meat space" has always been touted as one of the last safe refuges from ML, but it seems like it's not going to be as unapproachable as we thought.
ygouzerh 8 days ago |
Thanks for the sharing!