We’re just getting started with 3D and incentives for it to improve are strong
BUT, you can turn around and see something that I presume was entirely generated. So I don't think it is just doing some clever tricks to make the photo look 3D, but also "infilling" what is behind the camera too. That is kinda cool.
I'd love to see this improved so I can walk around some more though, to see what is down those alleys etc.
They're a really cool way to capture spatial memories though! Friends and I occasionally use Polycam or Luma Labs apps. But there's not _too much_ you can do with them due to the above limitations.
From a brief look at the OP link, World Labs seems to be generating a 360º gaussian splat (for a limited view box) from a still photo, which is cool as hell! But we still have the same problem of "what do we do with gaussian splats".
[1] This description is hand-wavey as I'm a relative layman when it comes to how these work. I'm sure someone can reply with a more precise answer if this one is bad.
Fei-Fei Li is a luminary in the field, and she's assembled a stellar team of some of the best researchers in the space.
Their gamble is that they'll be able to move faster than the open research and other companies looking to productionize that research.
Time will tell if this can become an ElevenLabs or if it'll fizzle out like Character.ai.
My worry is that without a product, they'll malinvestment their research into cool problems that don't satisfy market demand. There's nothing like the north star of customers. They'll also have a tough time with hiring going forward with that valuation.
The market of open research and models is producing a lot of neat stuff in 3D. But there's no open pool of data yet, despite HuggingFace and others trying.
We'll see what happens.
What would have to be true for this to be worth $100 billion after 5-10 years?
That money will go fast though, given GPU costs and salary ranges in the bay
Before: You and your cofounders share 100% of the stock in your company that's valued at $X.
After: Your company now has everything it had before plus $Y worth of "something" from the VCs. Your company is now valued at $X plus $Y. The VCs now hold stock in your company worth $Y. You and your cofounders still hold stock your company worth $X.
"Something" might be anything. Cash, stocks, commitments to resources, whatever.
Appears to be similar capability as OP for creating scene depth from 2D, but using point cloud instead of gaussian splatting for rendering so looks more pixelated: https://github.com/akbartus/360-Depth-in-WebXR
Also unlike the World Lab example you have the ability to go further outside the bounds of the point cloud to inspect the deficiencies of the approach. It's getting there but still needs work.
I do think there is the possibility to use something like this eventually to do all the processing in the browser for Depth Anywhere + Splat reconstruction to fill in the holes of the current point cloud approach: https://github.com/ArthurBrussee/brush
I get these are early steps, but they oversold it.
“We are hard at work improving the size and fidelity of our generated worlds”
I imagine the further you move from the input image, the more the model has to make up information and the harder to keep it consistent. Similar problem with video generation.
Which is the same thing as saying this may turn out to be a dud, like so many other things in tech and the current crop of what we’re calling AI.
Like I said, I get this is an early demo, but don’t oversell it. They could’ve started by being honest and clarifying they’re generating scenes (or whatever you want to call them, but they’re definitely not “worlds”), letting you play a bit, then explain the potential and progress. As it is, it just sounds like they want to immediately wow people with a fantasy and it detracts from what they do have.
What previous work are you referring to?
You can imagine this as a spectrum. On the one end you have models that, at each output pixel, try to predict pixels that are locally similar to ones in previous frame; on the other end, you could imagine models that "parse" the initial input image to understand the scene - objects (buildings, doors, people, etc.) and their relationships, and separately, the style with which they're painted, and use that to extrapolate further frames[0]. The latter would obviously fare better, remaining stylistically consistent for longer.
(This model claims to be of the second kind.)
The way I see it: a human could do it[1], so there's no reason an ML model wouldn't be able to.
--
[0] - Brute-force approach: 1) "style-untransfer" the input, i.e. style-transfer to some common style, e.g. photorealistic or sketch, 2) extrapolate the style-untransfered image, and 3) style-transfer result back using original input as style reference. Feels like it should work somewhat okay-ish; wonder if anyone tried that.
[1] - And the hard part wouldn't be extrapolating the scene, but rather keeping the style.
It breaks down pretty quickly once you get outside the default bounds, as expected, though.
Generate incrementally using a pathfinding system for a bot to move around and "create the world" as it goes, as if a Google street view car followed the philosophy of George Berkeley.
Except this time there is no underlying consistent world, so it would be up to the algorithm to use dead reckoning or some kind of coordinate system to recognize that you're approaching a place you've "been" before, and incorporate whatever you found there into the new scenes it produces.
I suppose this is the easy part, actually; for me the real trouble might be collision based on the non-deterministic thing that was generated, i.e. how to decide which scene edges the player should be able to travel through, interact with, be stopped by, burned by, etc.
Consistent inconsistency gets old very very fast.
Best case: roguelike adventure.
But generally just phantasmagoria.
I know you didn’t mean it like this, but this is kind of an insult to the insane amounts of work that go into crafting just the RNG systems behind roguelikes.
Certainly has some value to it.. marketing, hiring, fundraising (Assuming its a private company)
My take is that its a good start and 3-4 years from now it will have a lot of potential value in world creation if they can make the next steps.
The risk is setting expectations that can't be fulfilled.
I'm in the 3D space and I'm optimistic about World Labs.
So I'm willing to accept the limitation, and at this point we know that this can only get better. Next I thought about the likelihood of Nvidia releasing an AI game engine, or more of a renderer, fully AI based. It should be happening within the next 10 years.
Imagine creating a game by describing scenes, like the ones in the article, with a good morphing technology between scenes, so that the transitions between them are like auto-generated scenes which are just as playable.
The effects shown in the article were very interesting, like the ripple, sonar or wave. The wave made me think about how trippy games could get in the future, more extreme versions of the Subnautica video [0] which was released last month.
We could generate video games which would periodically slip into hallucinations, a thing that is barely doable today, akin to shader effects in Far Cry or other games when the player gets poisoned.
Fiebertraum engine.
> at this point we know that this can only get better.
We don’t know that. It will probably get better, but will it be better enough? No one knows.
> It should be happening within the next 10 years.
Every revolution in tech is always ten years away. By now that’s a meme. Saying something is ten years away is about as valuable as saying one has no idea how doable it is.
> Imagine
Yes, I understand the goal. Everyone does, it’s not complicated. We can all imagine Star Trek technology, we all know where the compass is pointed, that doesn’t make it a given.
In fact, the one thing we can say for sure about imagining how everything will be great in ten years is that we routinely fail to predict the bad parts. We don’t live in fantasy land, advancements in tech are routinely used for detrimental reasons.
We might all be dead in 10 years, but with big tech companies making their plays, all the VC money flowing in to new startups, and nuclear plants being brought online to power the next base model training runs, there's room for a little mild entertainment like these sorts of gimmicks in the next 3 years or so. I doubt anything that comes of it will top even my top 15 video games list though.
That’s a contestant for the most depressing tradeoff ever. “Yeah, we’ll all die in agony way before our time, but at least we got to play with a neat but ultimately underwhelming tool for a bit”.
Why do you think this game would be good? I'm not a game maker but the visual layer is not the reason people like or enjoy a game (ex: nintendo). There are teams of professionals making games today that range from awful to great. I get that there are indie games made by a single person that will benefit from generated graphics, but asset creation seems to be a really small part of it.
Let me elaborate by using cat-4d.github.io, one of their competitors in this field of research: If you look at the "How it works" section you can see that the first step is to take an input video and then create artificial viewpoints of the same action being observed by other cameras. And then in the 2nd step, those viewpoints are merged into one 4D gaussian splatting scene. That 2nd step is pretty similar to 4D NeRF training, BTW, just with a different data format.
Now if you need a small scene, you generate a few camera locations that are nearby. But there's nothing stopping you from generating other camera locations or even from using previously generated camera locations and moving the camera again, thereby propagating details that the AI invented outwards. So you could imagine this as you start with something "real" at the center of the map and then you create AI fakes with different camera positions in a circle around the real stuff, and then the next circle around the 1st-gen fakes, and the next circle, and so on. This process is mostly limited by 2 things: The ability of your AI model to capture a global theme. World Labs has demonstrated that they can invent details consistent with a theme in this demos, so I would assume they solved this already. And the other limit is computing time. A world box 2x in each direction is 8x the voxel data and I wouldn't be surprised if you need something like 16x to 32x the number of input images to fit the GSplats/NeRF.
So most likely, the box limit is purely because the AI model is slow and execution is expensive and they didn't want to spend 10,000x the resources for making the box 10x larger.
They most likely will have to pivot a few times but once they show their LLM solving problems that others can't, the others will quickly add these features too. Right now, it is cheaper to wait for World Labs to go first. The others are not that far behind: https://cat-4d.github.io/
Something like this applied to every frame of the movie would allow you to move around a little and preserve the perspective shifts. The limitation that you can only move about 4 feet in any direction would not matter for this use case.
Of course this comes at the expense of the director and cinematographer's intention, which is no small thing.
Your point is completely correct. Even Apple's awesome new stereoscopic 3D short film for the AVP immediately loses what it could be its total awesomeness from this basic fact. The perspective being perfectly fixed will never quite be there to fool our brains so used to dealing with these micro-movements.
Maybe this could be done with just the aperture and focal distance, which most modern cinema cameras record as they film.
Is XYZ diffusion-transformers? Or is XYZ Chameleon? Or some novel architecture?
It takes the absolute fastest teams, it seems, 7 months to develop a first version of a model. And it also seems that models are like babies, 9 moms do not produce a model in 1 month.
The tough thing is that it may be possible to develop a great video model with DiTs for $220m; or it may be possible to develop a great video model with Chameleon for $1b; but if it's 3D + time, will it be too expensive for them to do?
The craziest thing to me is that these guys are super talented, but they might not have enough money!
(I'm part of World Labs)
The "Step into Paintings" section cracked me up. As soon as you pan away from the source material, the craziness of the model is on full display. So sure, I can experience iconic pieces of art in a new way, it's just not a good experience.
Can you use this "3D world" with blender, unity or whatever else? Can you even do anything remotely useful with it?
Splats very new and still many things to figure out: relighting, animation, interactivity.
It's more hypecycle nonsense. But they'll poor billions into this rather than pay human artists what they're worth.
After seeing what they are showing on their page I am majorly impressed and don't feel they are misleading anyone at all
https://www.youtube.com/watch?v=7SVD_tLGAJk
The environment is 2.5D so mostly a displaced depth map but with some additional logic to keep the ground and horizon sane.
We aren't planning to do it for external use but it is creating a glb that you can download in theory if you know where it points to on our server
I maintain and open source framework called A-Frame for Web based 3D, VR, AR and the community would love something like that
Keep up the good work
We used GodotEngine in the past when it was still called VRWorkout but had to switch to Unity due to business reasons.
The environment creator uses several off the shelf models under the hood with custom loras and blender at the end to create the exportable meshes.
Users usually need to workout in the game to achieve coins to generate environments because we have no actual monetization behind it, so we can't have people generate endless amounts of environments, but if you want to try it out send me a message at michael -at- xrworkout.io and I'll set you up so you can try it.
so the headline should be: Generate 3D worlds from a single image rendered by us that we used to train our model.
Maybe when there is better technology for viewing 3D content.
I would guess they are pre-generating the world from the image, not generating it on the go as it runs reasonably well, but doesn't this really limit world size?
I noticed some solid geometry that is accidentally transparent.
The stuff behind the camera looks pretty good, which is presumably fully generated, so if they can make it so you can actually move around more, and with similar quality it could be interesting.
I do wish the examples had a full screen button, the view is tiny..
It won't be Hollywood quality. It'll either look like animation or early mixed CGI/live action stuff with "wooden" performances, etc., but it will let them see their work acted out which will be super cool.
Obviously the pro version of this with detailed editing and incorporating real human actors for the starring roles will be what's used to make a lot of real film and TV serious content.