In real life, you'd use these models for synthetic depth-of-field, adding fake bokeh to a very sharp image that's in focus everywhere. so this seems too easy?
Impressive latency tho.
https://youtu.be/pLfCdI0mjkI?si=8K7rPHu558P-Hf-Z
I assume the first pass is the depth inference here.
Of course, lots of text-to-image models generate a mess, because their training sets are highly contaminated by the messes produced by “Portrait mode”.
I'd bet that if you did that on these examples you'd see that the hair, rather than being attached to the animal, is floating halfway between the animal and the background. Of course, depth mapping is an ill-posed problem. The hair is not completely opaque and the pixels in that region have contributions from both the hair and the background, so the neural net is doing the best it can. To really handle hair correctly you would have to output a list of depths (and colors) per pixel, rather than a single depth, so pixels with contributions from multiple objects could be accurately represented.
They do this. See figure 4 in the paper. Are the results cherry-picked to look good? Probably. But so is everything else.
We plug depth maps produced by Depth Pro, Marigold, Depth Anything v2, and Metric3D v2 into a recent publicly available novel view synthesis system.
We demonstrate results on images from AM-2k. Depth Pro produces sharper and more accurate depth maps, yielding cleaner synthesized views. Depth Anything v2 and Metric3D v2 suffer from misalignment between the input images and estimated depth maps, resulting in foreground pixels bleeding into the background.
Marigold is considerably slower than Depth Pro and produces less accurate boundaries, yielding artifacts in synthesized images.
You're correct about that, but for something like matte/depth-treshold that's exactly what you want to get a smooth and controllable transition within the limited amount of resolution you have. For that use case, especially with the fuzzy hair, it's pretty good.
Each pixel takes a multiple measurements over time of the intensity of reflected light that matches the emission pulse encodings. The result is essentially a vector of intensity over a set of distances.
A low depth resolution example of reflected intensity by time (distance):
i: _ _ ^ _ ^ - _ _ d: 0 1 2 3 4 5 6 7
In the above example, the pixel would exhibit an ambiguity between distances of 2 and 4.
The simplest solution is to select the weighted average or median distance, which results in "flying pixels" or "mixed pixels" for which there are existing efficient techniques for filtration. The bottom line is that for applications like low-latency obstacle detection on a cost-constrained mobile robot, there's some compression of depth information required.
For the sake of inferring a highly realistic model from an image, Neural radiance fields or gaussian splats may best generate the representation that you might be envisioning, where there would be a volumetric representation of material properties like hair. This comes with higher compute costs however and doesn't factor in semantic interpretation of a scene. The Top performing results in photogrammetry have tended to use a combination of less expensive techniques like this one to better handle sparsity of scene coverage, and then refining the a result using more expensive techniques [1].
So this just makes it easier to swap it out without making any other changes.
https://i.pinimg.com/originals/00/f4/8c/00f48c6b443c0ce14b51...
?
Plausibly, you can train a model that encodes sufficient information about a specific set of imager+lens combinations such that the lens distortion behavior of images captured by those imagers+lenses provides the necessary information to resolve the scale of objects, but that is a much weaker claim than what monodepth researchers generally make.
Two notable cases where something like monodepth does reliably work are actually ones where considerably more information is present: in animal eyes there is considerable information about focus available, let alone the fact that eyes are nothing like a planar imager; and phase-detection autofocus uses an entirely different set of data (phase offsets via special lenses) than is used by monodepth models (and, arguably, is mostly a relative incremental process rather than something that produces absolute depth).