Estimating the depth of objects from a single image is a valuable task for many vision, robotics, and graphics applications. However, current methods often fail to produce accurate depth for objects in diverse scenes. In this work, we propose a simple yet effective Background Prompting strategy that adapts the input object image with a learned background. We learn the background prompts only using small scale synthetic object datasets. To infer object depth on a real image, we place the segmented object into the learned background prompt and run off-the-shelf depth networks. Background Prompting helps the depth networks focus on the foreground object, as they are made invariant to background variations. Moreover, Background Prompting minimizes the domain gap between synthetic and real object images, leading to better sim2real generalization than simple finetuning. Results on multiple synthetic and real datasets demonstrate consistent improvements in real object depths for a variety of existing depth networks.

Method overview


We learn the background prompts using a small dataset of synthetic objects (Amazon Berkeley Objects), which we render using the default hyperparameters found in the original ABO dataset and HM3D-ABO.

Example results

Background Parameterization

We propose parameterizing the prompts in Fourier Space instead of pixel space. As pointed out in previous works, this parameterization creates different basins of attraction than pixel space paramterization, which results in better generalization.

In-distribution and Out-of-distribution performance

Tables 1 and 2 show in-distribution (validation set of ABO and HM3D) and out-of-distribution (samples from datasets) performance, respectively. Our prompting strategy achieves strong performance compared to the default off-the-shelf models, by only modifying the input pixels. Furthermore our background prompting strategy achieves good results compared to finetuning, which requires modifying all the parameters of the network.

Inferred Backgrounds

When feeding only the background prompts to the networks (without the foreground object being inpainted), the network infers depths that are similar across networks. When the backgrounds are produced by conditioning on semantic masks (PNet), the inferred depths are similar to a box that follows the Manhattan grid orientation of the masks.



The template for this website was borrowed from Nerfies. We thank the authors for opensourcing it.