Common misconceptions about the complexity in robotics vs. AI (2024)

140 points by wallflower 4 days ago | 78 comments

jvanderbot a day ago |
> Moravec’s paradox is the observation by artificial intelligence and robotics researchers that, contrary to traditional assumptions, reasoning requires very little computation, but sensorimotor and perception skills require enormous computational resources. The principle was articulated by Hans Moravec, Rodney Brooks, Marvin Minsky, and others in the 1980s.
I have a name for it now!
I've said over and over that there are only two really hard problems in robotics: Perception and funding. A perfectly perceived system and world can be trivially planned for and (at least proprio-)controlled. Imagine having a perfect intuition about other actors such that you know their paths (in self driving cars), or your map is a perfect voxel + trajectory + classification. How divine!
It's limited information and difficulties in reducing signal to concise representation that always get ya. This is why the perfect lab demos always fail - there's a corner case not in your training data, or the sensor stuttered or became misaligned, or etc etc.
jvanderbot a day ago |
> Moravec hypothesized around his paradox, that the reason for the paradox [that things we perceive as easy b/c we dont think about them are actually hard] could be due to the sensor & motor portion of the human brain having had billions of years of experience and natural selection to fine-tune it, while abstract thoughts have had maybe 100 thousand years or less
Another gem!
Legend2440 a day ago |
Or it could be a parallel vs serial compute thing.
Perception tasks involve relatively simple operations across very large amounts of data, which is very easy if you have a lot of parallel processors.
Abstract thought is mostly a serial task, applying very complex operations to a small amount of data. Many abstract tasks like evaluating logical expressions cannot be done in parallel - they are in the complexity class P-complete.
Your brain is mostly a parallel processor (80 billion neurons operating asynchronously), so logical reasoning is hard and perception is easy. Your CPU is mostly a serial processor, so logical reasoning is easy and perception is hard.
cratermoon 21 hours ago |
> Perception tasks involve relatively simple operations across very large amounts of data, which is very easy if you have a lot of parallel processors.
Yes, relatively simple. Wait, isn't that exactly what the article explained was completely wrong-headed?
burnished 17 hours ago |
No. The article is talking about things we think of as being easy because they are easy for a human to perform but that are actually very difficult to formalize/reproduce artificially.
The person you are responding to is instead comparing differences in biological systems and mechanical systems.
visarga 10 hours ago |
> Or it could be a parallel vs serial compute thing.
The brain itself is both a parallel system an a serially constrained system. It has distributed activity but it must resolve in a serial chain of action. We can't walk left and right at the same time. Any goals forces us to follow specific steps in specific order. This conflict between parallel processing and serial outputs is where the magic happens.
topherclay 18 hours ago |
> ...the sensor & motor portion of the human brain having had billions of years of experience.
It doesn't really change the significance of the quote, but I can't help but point out that we didn't even have nerve cells more than 0.6 billion of years ago.
lang4d 21 hours ago |
Maybe just semantics, but I think I would call that prediction. Even if you have perfect perception (measuring the current state of the world perfectly), it's nontrivial to predict the future paths of other actors. The prediction problem requires intuition about what the other actors are thinking, how their plans influence each other, and how your plan influences them.
bobsomers 21 hours ago |
> I've said over and over that there are only two really hard problems in robotics: Perception and funding. A perfectly perceived system and world can be trivially planned for and (at least proprio-)controlled.
Funding for sure. :)
But as for perception, the inverse is also true. If I have an perfect planning/prediction system, I can throw the grungiest, worst perception data into it and it will still plan successfully despite tons of uncertainty.
And therein lies the real challenge of robotics: It's fundamentally a systems engineering problem. You will never have perfect perception or a perfect planner. So, can you make a perception system that is good enough that, when coupled with your planning system which is good enough, you are able to solve enough problems with enough 9s to make it successful.
The most commercially successful robots I've seen have had some of the smartest systems engineering behind them, such that entire classes of failures were eliminated by being smarter about what you actually need to do to solve the problem and aggressively avoid solving subproblems that aren't absolutely necessary. Only then do you really have a hope of getting good enough at that focused domain to ship something before the money runs out. :)
portaouflop 20 hours ago |
> being smarter about what you actually need to do to solve the problem and aggressively avoid solving subproblems that aren't absolutely necessary
I feel like this is true for every engineering discipline or maybe even every field that needs to operate in the real world
vrighter 18 hours ago |
except software, of course. Nowadays it seems that software is all about creating problems to create solutions for.
krisoft 15 hours ago |
> If I have an perfect planning/prediction system, I can throw the grungiest, worst perception data into it and it will still plan successfully despite tons of uncertainty.
Not really. Even the perfect planning system will appear eratic in the presence of perception noise. It must be because it can’t create information out of nowhere.
I have seen robots eratically stop because they thought that the traffic in the oncomming lane is enroaching on theirs. You can’t make the planning system ignore that because then sometimes it will collide with people playing chicken with you.
Likewise I have seen robots eratically stop because they thought that a lamp post was slowly reversing out in front of them. All due to perception noise (in this case both location noise, and misclassification.)
And do note that these are just the false positives. If you have a bad perception system you can also suffer from false negatives. Just experiment biases hide those.
So your “perfect planning/prediction” will appear overly cautious while at the same time will be sometimes reckless. Because it doesn’t have the information to not to. You can’t magic plan your way out of that. (Unless you pipe the raw sensor data into the planner, in which case you created a second perception system you are just not calling it perception.)
YeGoblynQueenne 10 hours ago |
>> (Unless you pipe the raw sensor data into the planner, in which case you created a second perception system you are just not calling it perception.)
Like with model-free RL learning a model from pixels?
jvanderbot 11 hours ago |
A "perfect" planning system which can handle arbitrarily bad perception is indistinguishable from a perception system.
I've not seen a system that claimed to be robust to sensor noise that didn't do some filtering, estimation, or state representation internally. Those are just sensor systems inside the box.
exe34 19 hours ago |
"the sensor stuttered or became misaligned, or etc etc."
if your eyes suddenly crossed, you'd probably fall over too!
seanhunter 18 hours ago |
Yeah the fun way Moravec's paradox was explained to me [1] is that you can now easily get a computer to solve simultaneous differential equations governing all the axes of motion of a robot arm but getting it to pick one screw out of a box of screws is an unsolved research problem.
[1] by a disillusioned computer vision phd that left the field in the 1990s.
wrp 17 hours ago |
Selective attention was one of the main factors in Hubert Dreyfus' explanation of "what computers can't do." He had a special term for it, which I can't remember off-hand.
visarga 10 hours ago |
> A perfectly perceived system and world can be trivially planned for
I think it's not about perfect perception, there is no such thing not even in humans, it's about adaptability, recovery from error, resilience, and mostly about learning from the outside when the process fails to work. Each problem has its own problem space to explore. I think of intelligence as search efficiency across many problem spaces, there is no perfection in it. Our problem spaces are far from exhaustively known.
catgary a day ago |
Yeah, this was my general impression after a brief, disastrous stretch in robotics after my PhD. Hell, I work in animation now, which is a way easier problem since there are no physical constraints, and we still can’t solve a lot of the problems the OP brings up.
Even stuff like using video misses the point, because so much of our experience is via touch.
johnwalkr 12 hours ago |
I've worked in a robotics-adjacent field for 15 years and robotics is truly hard. The number of people and companies I've seen come and go that claim their software expertise will make a useful, profitable robot is.. a lot.
Legend2440 a day ago |
Honestly I'm tired of people who are more focused on 'debunking the hype' than figuring out how to make things work.
Yes, robotics is hard, and it's still hard despite big breakthroughs in other parts of AI like computer vision and NLP. But deep learning is still the most promising avenue for general-purpose robots, and it's hard to imagine a way to handle the open-ended complexity of the real world other than learning.
Just let them cook.
mitthrowaway2 21 hours ago |
> If you want a more technical, serious (better) post with a solution oriented point to make, I’ll refer you to Eric Jang’s post [1]
[1] https://evjang.com/2022/07/23/robotics-generative.html
FloorEgg 19 hours ago |
As someone on the sidelines of robotics who generally feels everything getting disrupted and at the precipice of major change, it's really helpful to have a clearer understanding of the actual challenge and how close we are to solving it. Anything that helps me make more accurate predictions will help me make better decisions about what problems I should be trying to solve and what skills I should be trying to develop.
cratermoon a day ago |
It might be nice if the author qualified "most of the freely available data on the internet" with "whether or not it was copyrighted" or something to acknowledge the widespread theft of the works of millions.
danielbln 19 hours ago |
Theft is the wrong term, it implies that the original is no longer available. It's copyright infringement at best, and possibly fair use depending on jurisdiction. It wasn't theft when the RIAA went on a lawsuit spree against mp3 copying, and it isn't theft now.
CaptainFever 17 hours ago |
Related: https://www.youtube.com/watch?v=IeTybKL1pM4
cratermoon 11 hours ago |
Ackchyually....
jes5199 21 hours ago |
I would love to see some numbers. How many orders of magnitude more complicated do we think embodiment is, compared to conversation? How much data do we need compared to what we’ve already collected?
FloorEgg 19 hours ago |
If nature computed both through evolution, then maybe it's approximately the same ratio. So roughly the time it took to evolve embodiment, and roughly the time it took to evolve from grunts to advanced language.
If we start from when we think multicellular life first evolved (~2b years), or maybe the Cambrian explosion (~500m years), and until modern humans (~300k years). Then compare that to the time between first modern humans now now.
It seems like maybe 3-4 orders of magnitude harder.
My intuition after reading the articles is that there needs to be way more sensors all throughout the robot, probably with lots of redundancies, and then lots of modern LLM sized models all dedicated to specific joints and functions and capable of cascading judgement between each other, similar to how our nervous system works.
jes5199 8 hours ago |
so like ten to twenty years, via moore’s law?
daveguy 5 hours ago |
Maybe. If Moore's law remotely holds up for ten to twenty years. There's still the part about not having a clue how to replicate physical systems efficiently vs logical systems.
rstuart4133 17 hours ago |
"Hardness" is a difficult quantity to define if you venture beyond "humans have been trying to build systems to do this for a while, and haven't succeeded".
Insects have succeed in build precision systems that combine vision, smell, touch and a few other senses. I doubt finding a juicy spider, immobilising it, is that much more difficult that finding a door knob and turning it, or folding a T-Shirt. Yet insects accomplish it with I suspect far less compute than modern LLM's. So it's not "hard" in the sense of requiring huge compute resources, and certainly not a lot of power.
So it's probably not that hard in the sense that it's well within the capabilities of the hardware we have now. The issue is more that we don't have a clue how to do it.
jes5199 8 hours ago |
well the magic of transformer architecture is that if the rules exist and are computationally tractable, the system will find them in the data, and we don’t have to have a clue. so. how much data do we need?
BlueTemplar 7 hours ago |
Calling it "compute" might be part of the issue : insects aren't (even partially) digital computers.
We might or might not be able to emulate what they process on digital computers, but emulation implies a performance loss.
And this doesn't even cover inputs/outputs (some of which might be already good enough for some tasks, like the article's example of remotely operated machines).
rstuart4133 2 hours ago |
> Calling it "compute" might be part of the issue : insects aren't (even partially) digital computers.
I have trouble with that. I date from the era when analogue computers were a thing. They didn't have a hope of keeping up with digital 40 years ago when clock speeds were measured in the KHz, and a flip flop took multiple mm². Now they are digital computersliterally 10's of thousands times faster and billions of times smaller.
The key weakness of analogue isn't speed, power consumption, or size. They excel in all those areas. Their problem is the signal degrades at each step. You can only chain a few steps together before it all turns to mush. Digital can chain an unlimited number of steps of course. Because it's unlimited can emulate any analogue system with reasonable fidelity. We can emulate the weather for a few days out, and it is one of the most insanely complex analogue systems you are likely to come across.
Emulating analogue systems using lots of digital steps costs you size and power of course. In a robot we don't have unlimited amounts of either. However right now if someone pulled off the things he is talking about while hooked up to an entire data centre we'd be ecstatic. That means can't even solve the problem given unlimited power and space. We truely don't have a clue. (To be fair this isn't true any more if you consider Waymo to be a working example. But it's just one system, and we haven't figured out how to generalise it yet.)
By the way, this "analogue losses fidelity" problem applies to all systems, even insects. The solution is always the same: convert it to digital. And it has to happen very early. Our brains are only 10 neurons deep as I understand it. They are digital. 10 steps is far too much for analogue. It's likely the very first process steps in all our senses such as eyesight are analogue. But before the information leaves the eyeball it's already been converted to digital pulses running down the optic nerve. It's the same story everywhere. This is true for our current computer systems too of course. Underneath, MLC flash uses muplitple voltages, QAM is a encoding of multiple bits in a sine wave, a pixel in a camera is the output from multiple sensors. We do some very simply analogue manipulation on it like amplification, then convert it to digital before it turns to mush.
timomaxgalvin 17 hours ago |
I feel more tired after driving all day than reading all day.
jes5199 8 hours ago |
man I don’t. I can drive for 12+ hours. I can be on the internet for like 6
daveguy 5 hours ago |
Modern ADAS probably makes the driving much easier. What about reading print? Just as long? (Wondering the screen fatigue aspect vs just language processing)
no_op 21 hours ago |
I think Moravec's Paradox is often misapplied when considering LLMs vs. robotics. It's true that formal reasoning over unambiguous problem representations is easy and computationally cheap. Lisp machines were already doing this sort of thing in the '70s. But the kind of commonsense reasoning over ambiguous natural language that LLMs can do is not easy or computationally cheap. Many early AI researchers thought it would be — that it would just require a bit of elaboration on the formal reasoning stuff — but this was totally wrong.
So, it doesn't make sense to say that what LLMs do is Moravec-easy, and therefore can't be extrapolated to predict near-term progress on Moravec-hard problems like robotics. What LLMs do is, in fact, Moravec-hard. And we should expect that if we've got enough compute to make major progress on one Moravec-hard problem, there's a good chance we're closing in on having enough to make major progress on others.
bjornsing 17 hours ago |
Good points. Came here to say pretty much the same.
Moravec's Paradox is certainly interesting and correct if you limit its scope (as you say). But it feels intuitively wrong to me to make any claims about the relative computational demands of sensi-motor control and abstract thinking before we’ve really solved either problem.
Looking e.g. at the recent progress in solving ARC-AGI my impression is that abstract thought could have incredible computational demands. IIRC they had to throw approximately $10k of compute at o3 before it reached human performance. Now compare how cognitively challenging ARC-AGI is to e.g. designing or reorganizing a Tesla gigafactory.
With that said I do agree that our culture tends to value simple office work over skillful practical work. Hopefully the progress in AI/ML will soon correct that wrong.
RaftPeople 7 hours ago |
Also agree and also came here to say the same.
lsy 17 hours ago |
Leaving aside the lack of consensus around whether LLMs actually succeed in commonsense reasoning, this seems a little bit like saying “Actually, the first 90% of our project took an enormous amount of time, so it must be ‘Pareto-hard’. And thus the last 10% is well within reach!” That is, that Pareto and Moravec were in fact just wrong, and thing A and thing B are equivalently hard.
Keeping the paradox would more logically bring you to the conclusion that LLMs’ massive computational needs and limited capacities imply a commensurately greater, mind-bogglingly large computational requirement for physical aptitude.
nopinsight 16 hours ago |
It's far from obvious that thought space is much less complex than physical space. Natural language covers emotional, psychological, social, and abstract concepts that are orthogonal to physical aptitude.
While the linguistic representation of thought space may be discrete and appear simpler (even the latter is arguable), the underlying phenomena are not.
Current LLMs are terrific in many ways but pale in comparison to great authors in capturing deep, nuanced human experience.
As a related point, for AI to truly understand humans, it will likely need to process videos, social interactions, and other forms of data beyond language alone.
visarga 11 hours ago |
I think the essence of human creativity is outside our brains - in our environments, our search spaces, our interactions. We stumble upon discoveries or patterns, we ideate and test, and most ideas fail but a few remain. And we call it creativity, but it's just environment tested ideation.
If you put an AI like AlphaZero in a Go environment it explores so much of the game space that it invents its own Go culture from scratch and beats us at our own game. Creativity is search in disguise, having good feedback is essential.
AI will become more and more grounded as it interacts with the real world, as opposed to simply modeling organic text as GPT-3. More recent models generate lots of synthetic data to simulate this process, and it helps up to a point, but we can't substitute artificial feedback for real one except in a few cases: like AlphaZero, AlphaProof, AlphaCode... in those cases we have the game winner, LEAN as inference engine, and code tests to provide reliable feedback.
If there is one concept that underlies both training and inference it is search. And it also underlies action and learning in humans. Learning is compression which is search for optimal parameters. Creativity is search too. And search is not purely mental, or strictly 1st person, it is based on search spaces and has a social side.
jillesvangurp 19 hours ago |
Yesterday, I was watching some of the youtube videos on the website of a robotics company https://www.figure.ai that challenges some of the points in this article a bit.
They have a nice robot prototype that (assuming these demos aren't faked) does fairly complicated things. And one of the key features they show case is using OpenAI's AI for the human computer interaction and reasoning.
While these things seem a bit slow, they do get things done. They have a cool demo of the a human interacting with one of the prototypes to ask it what it thinks needs to be done and then asking it do these things. That show cases reasoning, planning, and machine vision. Which are exactly topics that all the big LLM companies are working on.
They appear to be using an agentic approach similar to how LLMs are currently being integrated into other software products. Honestly, it doesn't even look like they are doing much that isn't part of OpenAI's APIs. Which is impressive. I saw speech capabilities, reasoning, visual inputs, function calls, etc. in action. Including the dreaded "thinking" pause where the Robot waits a few seconds for the remote GPUs to do their thing.
This is not about fine motor control but about replacing humans controlling robots with LLMs controlling robots and getting similarly good/ok results. As the article argues, the hardware is actually not perfect but good enough for a lot of tasks if it is controlled by a human. The hardware in this video is nothing special. Multiple companies have similar or better prototypes. Dexterity and balance are alright but probably not best in class. Best in class hardware is not the point of these demos.
Dexterity and real time feedback is less important than the reasoning and classification capabilities people have. The latency just means things go a bit slower. Watching these things shuffle around like an old person that needs to go to the bath room is a bit painful. But getting from A to B seems like a solved problem. A 2 or 3x speedup would be nice. 10x would be impressively fast. 100x would be scary and intimidating to have near you. I don't think that's going to be a challenge long term. Making LLMs faster is an easier problem than making them smarter.
Putting a coffee cup in a coffee machine (one of the demo videos) and then learning to fix it when it misaligns seems like an impressive capability. It compensates for precision and speed with adaptability and reasoning: analyze the camera input, correctly analyze the situation, problem and challenge come up with a plan to perform the task, execute the plan, re-evaluate, adapt, fix. It's a bit clumsy but the end result is coffee. Good demo and I can see how you might make it do all sorts of things that are vaguely useful that way.
The key point here is that knowing that the thing in front of the robot is a coffee cup and a coffee machine and identifying how those things fit together and in what context that is required are all things that LLMs can do.
Better feedback loops and hardware will make this faster, and less tedious to watch. Faster LLMs will help with that too. And better LLMs will result in less mistakes, better plans, etc. It seems both capabilities are improving at an enormously fast pace right now.
And a fine point with human intelligence is that we divide and conquer. Juggling is a lot harder when you start thinking about it. The thinking parts of your brain interferes with the lower level neural circuits involved with juggling. You'll drop the balls. The whole point with juggling is that you need to act faster than you can think. Like LLMs, we're too slow. But we can still learn to juggle. Juggling robots are going to be a thing.
GolfPopper 16 hours ago |
>The key point here is that knowing that the thing in front of the robot is a coffee cup and a coffee machine and identifying how those things fit together and in what context that is required are all things that LLMs can do.
I'm skeptical that any LLM "knows" any such thing. It's a Chinese Room. It's got a probability map that connects the lexeme (to us) 'coffee machine' and 'coffee cup' depending on other inputs that we do not and cannot access, and spits out sentences or images that (often) look right, but that does not equate to any understanding of what it is doing.
As I was writing this, I took chat GPT-4 for a spin. When I ask it about an obscure but once-popular fantasy character from the 70s cold, it admits it doesn't know. But, if I ask it about that same character after first asking about some obscure fantasy RPG characters, it cheerfully confabulates an authoritative and wrong answer. As always, if it does this on topics where I am a domain expert, I consider it absolutely untrustworthy for any topics on which I am not a domain expert. That anyone treats it otherwise seems like a baffling new form of Gell-Mann amnesia.
And for the record, when I asked ChatGPT-4, cold, "What is Gell-Mann amnesia?" it gave a multi-paragraph, broadly accurate description, with the following first paragraph:
"The Gell-Mann amnesia effect is a term coined by physicist Murray Gell-Mann. It refers to the phenomenon where people, particularly those who are knowledgeable in a specific field, read or encounter inaccurate information in the media, but then forget or dismiss it when it pertains to other topics outside their area of expertise. The term highlights the paradox where readers recognize the flaws in reporting when it’s something they are familiar with, yet trust the same source on topics outside their knowledge, even though similar inaccuracies may be present."
Those who are familiar with the term have likely already spotted the problem: "a term coined by physicist Murray Gell-Mann". The term was coined by author Michael Crichton.[1] To paraphrase H.L. Mencken, for every moderately complex question, there is an LLM answer that is clear, simple, and wrong.
1. https://en.wikipedia.org/wiki/Michael_Crichton#Gell-Mann_amn...
jillesvangurp 15 hours ago |
Hallucinations are a well known problem. And there are some mitigations that work pretty well. Mostly with enough context and prompt engineering, LLMs can be pretty reliable. And obscure popular fiction trivia is maybe not that relevant for every use case. Which would be robotics in this case; not the finer points of Michael Crighton related trivia.
You were testing its knowledge, not its ability to reason or classify things it sees. I asked the same question to perplexity.ai. If you use the free version, it uses less advanced LLMs but it compensates with prompt engineering and making it do a search to come up with this answer:
> The Gell-Mann Amnesia effect is a psychological phenomenon that describes people's tendency to trust media reports on unfamiliar topics despite recognizing inaccuracies in articles about subjects they know well. This effect, coined by novelist Michael Crichton, highlights a cognitive bias in how we consume news and information.
Sounds good to me. And it got me a nice reference to something called the portal wiki, and another one for the same wikipedia article you cited. And a few more references. And it goes on a bit to explain how it works. And I get your finer point here that I shouldn't believe everything I read. Luckily, my supervisor worked hard to train that out of me when I was doing a Ph. D. back in the day. But fair point and well made.
Anyway, this is a good example of how to mitigate hallucination with this specific question (and similar ones). Kind of the use case perplexity.ai was made to solve. I use it a lot. In my experience it does a great job figuring out the right references and extracting information from those. It can even address some fairly detailed questions. But especially on the freemium plan, you will run into limitations related to reasoning with what it extracts (you can pay them to use better models). And it helps to click on the links it provides to double check.
For things that involve reasoning (like coding), I use different tools. Different topic so won't bore you with that.
But what figure.ai is doing, falls well in the scope of several things openai does very well that you can use via their API. It's not going to be perfect for everything. But there probably is a lot that it nails without too much effort. I've done some things with their APIs that worked fairly well at least.
redlock 13 hours ago |
Do we know how human understanding works? It could be just statistical mapping as you have framed it. You can’t say llms don’t understand when you don’t have a measurable definition for understanding.
Also, humans hallucinate/confabulate all the time. Llms even forget in the same way humans do (strong recall in the start and end of the text but weaker in the middle)
YeGoblynQueenne 9 hours ago |
>> Good demo and I can see how you might make it do all sorts of things that are vaguely useful that way.
Unfortunately since that's a demo you have most likely seen all the sorts of things that are vaguely useful and that can be done easily, or at all.
Edit: Btw, the coffee task video says that the "AI" is "end-to-end neural networks". If I understand correctly that means an LLM was not involved in carrying out the task. At most an LLM may have been used to trigger the activation of the task, that was learned by a different method, probably some kind of imitation learning with deep RL.
Also, to see how much of a tech demo this is: the robot starts already in position in front of a clear desk and a human brings the coffee machine, positions it just so, places the cup in the holder and places a single coffee pod just so. Then the robot takes the coffee pod from the empty desk and places it in the machine, then pushes the button. That's all the interaction of the robot with the machine. The human collects the cup and makes a thumbs up.
Consider for a moment how much different is this laboratory instance of the task from any real-world instance. In my kitchen the coffee machine is on a cluttered surface with tins of coffee, a toaster, sometimes the group left on the machine, etc. etc - and I don't even use coffee pods but loose coffee. The robot you see has been trained to put that one pod placed in that particular spot in that one machine placed just so in front of it. It would have to be trained all over again to carry out the same task on my machine, it is uncertain if it could learn it successfully after thousands of demonstrations (because of all the clutter), and even if it did, it would still have to learn it all over again if I moved the coffee machine, or moved the tins, or the toaster; let alone if you wanted it to use your coffee machine (different colour, make, size, shape, etc) in your kitchen (different chaotic environment) (no offense meant).
Take the other video of the "real world task". That's the robot shuffling across a flat, clean surface and picking up an empty crate to put in an empty conveyor belt. That's just not a real world task.
Those are tech demos and you should not put much faith in them. That kind of thing takes an insane amount of work to set up just for one video, you rarely see the outtakes and it very, very rarely generalises to real-world utility.
jonas21 19 hours ago |
It's worth noting that modern multimodal models are not confused by the cat image. For example, Claude 3.5 Sonnet says:
> This image shows two cats cuddling or sleeping together on what appears to be a blue fabric surface, possibly a blanket or bedspread. One cat appears to be black while the other is white with pink ears. They're lying close together, suggesting they're comfortable with each other. The composition is quite sweet and peaceful, capturing a tender moment between these feline companions.
throw310822 17 hours ago |
Also Claude, when given the entire picture:
"This is a humorous post showcasing an AI image recognition system making an amusing mistake. The neural network (named "neural net guesses memes") attempted to classify an image with 99.52% confidence that it shows a skunk. However, the image actually shows two cats lying together - one black and one white - whose coloring and positioning resembles the distinctive black and white pattern of a skunk.
The humor comes from the fact that while the AI was very confident (99.52%) in its prediction, it was completely wrong..."
The progress we made in barely ten years is astounding.
timomaxgalvin 17 hours ago |
It's easy to make something work when the example goes from being out of the training data to into the training data.
throw310822 16 hours ago |
Definitely. But I also tried with a picture of an absurdist cartoon drawn by a family member, complete with (carefully) handwritten text, and the analysis was absolutely perfect.
visarga 10 hours ago |
A simple test - take one of your own photos, something interesting, and put in into a LLM, let it describe it in words. Then use a image generator to create the image back. It works like back-translation image->text->image. It proves how much the models really understand images and text.
BlueTemplar 12 hours ago |
I wouldn't blame a machine to fail something that a first glance looks like an optical illusion...
YeGoblynQueenne 9 hours ago |
And yet both these astounding explanations (yours and the one in the OP) are mistaking two cute kittens sleeping cuddled in an adorable manner for generic "cats lying together".
bjornsing 17 hours ago |
I’m surprised this doesn’t place more emphasis on self-supervised learning through exploration. Is human-labeled datasets really the SOTA approach for robotics?
psb217 9 hours ago |
Human-labeled data currently looks like the quickest path to making robots that are useful enough to have economic value beyond settings and tasks that are explicitly designed for robots. This has drawn a lot of corporate and academic research activity away from solving the harder core problems, like exploration, that are critical for developing fully autonomous intelligent agents.
MrsPeaches 17 hours ago |
Question:
Isn’t it fundamentally impossible to model a highly entropic system using deterministic methods?
My point is that animal brains are entropic and “designed” to model entropic systems, where as computers are deterministic and actively have to have problems reframed as deterministic so that they can solve them.
All of the issues mentioned in the article boil down to the fundamental problem of trying to get deterministic systems to function in highly entropic environments.
LLMs are working with language, which has some entropy but is fundamentally a low entropy system, and has orders of magnitude less entropy than most peoples’ back garden!
As the saying goes, to someone with a hammer, everything looks like a nail.
BlueTemplar 12 hours ago |
Not fundamentally, at least I doubt it : pseudo-random number generation is technically deterministic.
And it's used for sampling these low information systems that you are mentioning.
(And let's not also forget how they are helpful in sampling deterministic but extremely high complexity systems involving a high amount of dimensions that Monte Carlo methods are so good at dealing with.)
Peteragain 16 hours ago |
So I'm old. PhD on search engines in the early 1990's (yep, early 90s). Learnt AI in the dark days of the 80's. So, there is an awful lotl of forgetting going on, largely driven by the publish-or-perish culture we have. Brooks' subsumption architecture was not perfect, but it outlined an approach that philosophy and others have been championing for decades. He said he was not implementing Heidegger, just doing engineering, but Brooks was certainly channeling Heidegger's successors. Subsumption might not scale, but perhaps that is where ML comes in. On a related point, "generative AI" does sequences (it's glorified auto complete (not) according to Hinton in the New Yorker). Data is given to a Tokeniser that produces a sequence of tokens, and the "AI" predicts what comes next. Cool. Robots are agents in an environment with an Umwelt. Robotics is pre the Tokeniser. What is it the is recognisable and sequential in the world? 2 cents please.
marcosdumay 9 hours ago |
> Subsumption might not scale
Honestly, I don't think we have any viable alternative.
And anyway, it seems to scale well enough that we use "conscious" and "unconscious" decisions ourselves.
psb217 9 hours ago |
If you wanna sound hip, you need to call it "system 2" and "system 1".
Anotheroneagain 16 hours ago |
The reason why it sounds counterintuitive is that neurology has the brain upside down. It teaches us that formal thinking occurs in the neocortex, and we need all that huge brain mass for that.
But in fact it works like an autoencoder, and it reduces sensory inputs into a much smaller latent space, or something very similar to that. This does result in holistic and abstract thinking, but formal analytical thinking doesn't require abstraction to do the math or to follow a method without comprehension. It's a concrete approach that avoids the need for abstraction.
The cerebellum is the statistical machine that gets measured by IQ and other tests.
To further support that, you don't see any particularly elegant motions from non mammal animals. In fact everything else looks quite clumsy, and even birds need to figure out flying by trial and error.
daveguy 5 hours ago |
Claiming to know how the brain works, computationally or physically, might be a bit premature.
dbspin 16 hours ago |
I find it odd that the article doesn't address the apparent success of training with transformer based models in virtual environments to build models that are then mapped onto the real world. This is being used in everything from building datasets for self driving cars, to navigation and task completion for humanoid robots. Nvidia have their omniverse project [1], but there are countless other examples [2][3][4]. Isn't this obviously the way to build the corpus of experience needed to train these kinds of cross modal models?
[1] https://www.nvidia.com/en-us/industries/robotics/#:~:text=NV....
[2] https://www.sciencedirect.com/science/article/abs/pii/S00978...
[3] https://techcrunch.com/2024/01/04/google-outlines-new-method...
[4] https://techxplore.com/news/2024-09-google-deepmind-unveils-...
cybernoodles 16 hours ago |
A common practice is to train a transformer model to control a given robot model in simulation by first teleoperating the simulated model with some controller (keyboard, joystick, etc.) to complete the task and create a dataset, and then setting up the simulator to permute the environment variables such as frictions, textures, etc (domain randomization) and run many epochs at faster than real time until a final policy converges. If the right things were randomized and your demonstration examples provided enough variation of information, it should generalize well to the actual hardware.
CWIZO 16 hours ago |
> Robots are probably amazed by our ability to keep a food tray steady, the same way we are amazed by spider-senses (from spiderman movie)
Funnily, Toby Maguire actually did that tray catching stunt for real. So robots have an even further way to go.
https://screenrant.com/spiderman-sam-raimi-peter-parker-tray...
BlueTemplar 12 hours ago |
... but it took 156 takes as well as some glue.
And, as the article insists on, for robots to be acceptable, it's more like they need to get to a point where they fail 1 time in 156 (or even less, depending on how critical the failure is), rather than succeed 1 time in 156...
PeterStuer 16 hours ago |
Just some observations from an ex autonomous robotics researcher here.
One of the most important differences at least in those days (80's and 90's) was time. While the digital can be sped up just constrained by the speed of your compute, the 'real world' is very constrained by real time physics. You can't speed up a robot 10x in a 10.000 grabbing and stacking learning run without completely changing the dynamics.
Also, parallellizing the work requires more expensive full robots rather than more compute cores. Maybe these days the different ai gym like virtual physics environments offer a (partial) solution to that problem, but I have not used them (yet) so I can't tell.
Furthermore, large scale physical robots are far more fragile due to wear and tear than the incredible resilience of modern compute hardware. Getting a perfect copy of a physical robot and environment is a very hard, near impossible, task.
Observability and replay, while trivial in the digital world, is very limited in the physical environment making analysis much more difficult.
I was both excited and frustrated at the time by making ai do more than rearanging pixels on a 2D surface. Good times were had.
Havoc 15 hours ago |
Fun fact: that Spider-Man gif in there - it’s real. No CGI
bsenftner 15 hours ago |
This struck me as a universal truth: "our general intuition about the difficulty of a problem is often a bad metric for how hard it actually is". I feel like this is the core issue of all engineering, all our careers, and was surprised by the logic leap from that immediately to Moravec's Paradox, from a universal truth to a myopic industry insight.
Although I've not done physical robotics, I've done a lot of articulated human animation of independent characters in 3D animation. His insight that motor control is more difficult sets right with me.
cameldrv 15 hours ago |
Moravec's paradox is really interesting in terms of what it says about ourselves: We are super impressive in ways in which we aren't consciously aware. My belief about this is that our self-aware mind is only a very small part of what our brain is doing. This is extremely clear when it comes to athletic performance, but also there are intellectual things that people call intuition or other things, which aren't part of our self-aware mind, but still do a ton of heavy lifting in our day to day life.
NalNezumi 13 hours ago |
Oh weird to wake up to see something I wrote more than half year ago (and posted on HN with no traction) getting reposted now.
Glad to see so many different takes on it. It was written in slight jest as a discussion starter with my ML/neuroscience coworker and friends, so it's actually very insightful to see some rebuttals.
Initial post was twice the length, and had several more (in retrospect) interesting points. First ever blog post so reading it now fills me with cringe.
Some stuff have changed in only half year, so will see if the points stands the test of time ;]
bo1024 11 hours ago |
It's a good post, nice work.
lugu 12 hours ago |
I think one problem is composition. Computer multiplex access to CPU and memory, but this strategy doesn't work for actuator and sensors. That is why we see great demos of robots doing one thing. The hard part is to make them do multiple things at the same time.
gcanyon 10 hours ago |
> “Everyone equates the Skynet with the T900 terminator, but those are two very different problems with different solutions.” while this is my personal opinion, the latter one (T900) is a harder problem.
So based on this, Skynet had to hide and wait for years before being able to successfully revolt against the humans...
lairv 9 hours ago |
This post didn't really convince me that robotics is inherently harder than generating text or images
On the one hand we have problems where ~7B humans have been generating data for 30 years every day (more if you count old books), on the other hand we have a problem where researcher are working with ~1000 human collected trajectories (I think the largest existing dataset is OXE with ~1M trajectories: https://robotics-transformer-x.github.io/ )
Web-scale datasets for LLMs benefits from a natural diversity, they're not highly correlated samples generated by contractors or researchers in academic labs. In the largest OXE dataset, what do you think is the likelihood that there is a sample where a robot picks up a rock from the ground and throws it in a lake? Close to zero, because tele-operated data comes from a very constrained data distribution.
Another problem is that robotics doesn't have an easy universal representation for its data. Let's say we were able to collect web-scale dataset for one particular robot A with high diversity, how would it transfer to robot B with a slightly different design? Probably poorly, so not only does the data distribution needs to cover a high range of behavior, it must also cover a high range of embodiment/hardware
With that being said, I think it's fair to say that collecting large scale dataset for general robotics is much harder than collecting text or images (at least in the current state of humanity)