Depth Pro: Sharp monocular metric depth in less than a second

103 points by L_ 18 hours ago | 21 comments

isoprophlex 17 hours ago |
The example images look convincing, but the sharp hairs of the llama and the cat are pictured against an out-of-focus background...
In real life, you'd use these models for synthetic depth-of-field, adding fake bokeh to a very sharp image that's in focus everywhere. so this seems too easy?
Impressive latency tho.
JBorrow 8 hours ago |
I don't think the only utility of a depth model is to provide synthetic blurring of backgrounds. There are many things you'd like to use them for, including feeding into object detection pipelines.
dagmx 8 hours ago |
On visionOS 2, there’s functionality to convert 2D images to 3D images for stereo viewing.
https://youtu.be/pLfCdI0mjkI?si=8K7rPHu558P-Hf-Z
I assume the first pass is the depth inference here.
amluto 8 hours ago |
I’m not convinced that this type of model is the right solution to fake bokeh, at least not if you use it as a black box. Imagine you have the letter A in the background behind some hair. You should end up with a blurry A and most in-focus hair. Instead you end up with an erratic mess, because a fuzzy depth map doesn’t capture the relevant information.
Of course, lots of text-to-image models generate a mess, because their training sets are highly contaminated by the messes produced by “Portrait mode”.
habitue 8 hours ago |
Does apple actually name their research papers "Pro" too? Like is there an iLearning paper out there?
sockaddr 7 hours ago |
So what happens in the far future when we send autonomous machines equipped with models trained on Earth life and structures to other planets, are they going to have a hard time detecting and measuring things? What happens when the model is tasked with detecting the depth of an object that’s made of triangular glowing scales and the head has three eyes?
adolph 6 hours ago |
Assembly Theory
modeless 7 hours ago |
False color depth maps are extremely misleading. The way to judge the quality of a depth map is to use it to reproject the image into 3D and rotate it around a bit. Papers almost never do this because it makes their artifacts extremely obvious.
I'd bet that if you did that on these examples you'd see that the hair, rather than being attached to the animal, is floating halfway between the animal and the background. Of course, depth mapping is an ill-posed problem. The hair is not completely opaque and the pixels in that region have contributions from both the hair and the background, so the neural net is doing the best it can. To really handle hair correctly you would have to output a list of depths (and colors) per pixel, rather than a single depth, so pixels with contributions from multiple objects could be accurately represented.
jonas21 6 hours ago |
> The way to judge the quality of a depth map is to use it to reproject the image into 3D and rotate it around a bit.
They do this. See figure 4 in the paper. Are the results cherry-picked to look good? Probably. But so is everything else.
threeseed 5 hours ago |
> Figure 4
We plug depth maps produced by Depth Pro, Marigold, Depth Anything v2, and Metric3D v2 into a recent publicly available novel view synthesis system.
We demonstrate results on images from AM-2k. Depth Pro produces sharper and more accurate depth maps, yielding cleaner synthesized views. Depth Anything v2 and Metric3D v2 suffer from misalignment between the input images and estimated depth maps, resulting in foreground pixels bleeding into the background.
Marigold is considerably slower than Depth Pro and produces less accurate boundaries, yielding artifacts in synthesized images.
incrudible 5 hours ago |
> I'd bet that if you did that on these examples you'd see that the hair, rather than being attached to the animal, is floating halfway between the animal and the background.
You're correct about that, but for something like matte/depth-treshold that's exactly what you want to get a smooth and controllable transition within the limited amount of resolution you have. For that use case, especially with the fuzzy hair, it's pretty good.
modeless 5 hours ago |
It's not exactly what you want because you will get both background bleeding into the foreground and clipping of the parts of the foreground that fall under your threshold. What you want is for the neural net to estimate the different color contributions of background and foreground at each pixel so you can separate them without bleeding or clipping.
incrudible 4 hours ago |
It's what you'd want out of a depth map used for that purpose. What you're describing is not a depth map.
zardo 4 hours ago |
Maybe the depthMap should only accept images that have been typed as hairless.
Stedag 2 hours ago |
I work on Time-of-flight camera's that need to handle the kind of data that you're referring too.
Each pixel takes a multiple measurements over time of the intensity of reflected light that matches the emission pulse encodings. The result is essentially a vector of intensity over a set of distances.
A low depth resolution example of reflected intensity by time (distance):
i: _ _ ^ _ ^ - _ _ d: 0 1 2 3 4 5 6 7
In the above example, the pixel would exhibit an ambiguity between distances of 2 and 4.
The simplest solution is to select the weighted average or median distance, which results in "flying pixels" or "mixed pixels" for which there are existing efficient techniques for filtration. The bottom line is that for applications like low-latency obstacle detection on a cost-constrained mobile robot, there's some compression of depth information required.
For the sake of inferring a highly realistic model from an image, Neural radiance fields or gaussian splats may best generate the representation that you might be envisioning, where there would be a volumetric representation of material properties like hair. This comes with higher compute costs however and doesn't factor in semantic interpretation of a scene. The Top performing results in photogrammetry have tended to use a combination of less expensive techniques like this one to better handle sparsity of scene coverage, and then refining the a result using more expensive techniques [1].
1: https://arxiv.org/pdf/2404.08252
tedunangst 6 hours ago |
Funny that Apple uses bash for a shell script that just runs wget. https://github.com/apple/ml-depth-pro/blob/main/get_pretrain...
threeseed 5 hours ago |
It would be pulling from an internal service in a development branch.
So this just makes it easier to swap it out without making any other changes.
brcmthrowaway 6 hours ago |
Does this take lens distortion into account?
dguest 4 hours ago |
What does this look like on an M. C. Escher drawing, e.g.
https://i.pinimg.com/originals/00/f4/8c/00f48c6b443c0ce14b51...
?
yunohn 2 hours ago |
Looks like a screenshot from the Monument Valley games, full of such Escher like levels.
cpgxiii 4 hours ago |
The monodepth space is full of people insisting that their models can produce metric depth with no explanation other than "NN does magic" for why metric depth is possible from generic mono images. If you provide a single arbitrary image, you can't generate depth that is immune from scale error (e.g. produce accurate depth for a both an image of a real car and a scale model of the same car).
Plausibly, you can train a model that encodes sufficient information about a specific set of imager+lens combinations such that the lens distortion behavior of images captured by those imagers+lenses provides the necessary information to resolve the scale of objects, but that is a much weaker claim than what monodepth researchers generally make.
Two notable cases where something like monodepth does reliably work are actually ones where considerably more information is present: in animal eyes there is considerable information about focus available, let alone the fact that eyes are nothing like a planar imager; and phase-detection autofocus uses an entirely different set of data (phase offsets via special lenses) than is used by monodepth models (and, arguably, is mostly a relative incremental process rather than something that produces absolute depth).