Lotus: Diffusion-Based Visual Foundation Model for High-Quality Dense Prediction
46 points
1 day ago
| 2 comments
| lotus3d.github.io
| HN
curvilinear_m
23 hours ago
[-]
Can someone more knowledgeable than me help me understand a few points about this article ?

It claims to be diffusion-based, but the main 2 differences from an approach like Stable-Diffusion is that (1) they only consider a single step, instead of a traditional 1000 and (2) they directly predict the value z^y instead of a noise direction. According to their analyses, both of these differences help in the studied tasks. However, isn't that how supervised learning has always worked ? Aside from having a larger model, this isn't very different from "traditional" depth estimation that don't claim anything to do with diffusion.

It also claims to have zero-shot abilities, but they fine-tune the denoising model f_theta on a concatenation of the latent image and apply a loss using the latent label. So their evaluation dataset may be out-of-distribution, but I don't understand how that's zero-shot. Asking ChatGPT to output a depth estimation of a given image would be zero-shot because it hasn't been trained to do that (to my knowledge).

reply
thot_experiment
1 day ago
[-]
Very cool stuff! I got it running on Win10 with minimal effort. Really impressive results, I've been working on some plotter art and I use normal data to guide stroke orientation, in the past I've mostly worked with 3D scenes where the normal data is free, but I'm excited to try working with photos using this tool.
reply