Abstract:
In recent years, self-supervised monocular depth estimation methods have achieved impressive improvements. However, their performance degrades significantly when generating structured depth maps in complex indoor scenarios. To bridge this gap, focusing on the training process, we propose LoFtDepth, a novel method that combines self-supervised monocular depth estimation with local feature guided knowledge distillation. Firstly, an off-the-shelf depth estimation network is used to generate structured relative depth maps as depth priors. Local features are then extracted from these priors as boundary points, guiding the local depth refinement. This reduces the interference of depth-irrelevant features and transfers the boundary knowledge of depth priors to the self-supervised depth estimation network. Additionally, we introduce an inverse auto-mask weighted surface normal loss. This encourages normal directions of depth maps predicted by self-supervised network to align with those of depth priors in untextured regions. As a result, the depth estimation accuracy is enhanced. Finally, according to the coherence of camera motion, we impose a pose consistency constraint on residual pose estimation. This constraint enables effective adaptation to indoor scenes where camera poses change frequently, thereby mitigating training errors and boosting model performance. Extensive experiments on major indoor datasets demonstrate that LoFtDepth outperforms previous methods. It reduces the absolute relative error to 0.121, and successfully generates accurate and well-structured depth maps.