Uma Comparação De Arquiteturas Baseadas Em U-net Na Estimativa De Profundidade Monocular

Silva, Antônio Carlos Durães da (2024-11-19)

dissertacao_mestrado

Monocular depth estimation is a fundamental task in computer vision, with applications in areas such as augmented reality, autonomous navigation, and medical procedures. This work investigates the reuse of architectures originally developed for semantic segmentation, such as U-Net and its variants, in the task of depth estimation. Combinations of U-Net and UNet++ architectures with different encoders were implemented and evaluated, covering convolutional networks, Transformer networks, and hybrid architectures, including VGG-19 (BN), Xception, Inception-ResNet-v2, Mixed Transformer (B2), CoaT, CoAtNet, and TransUnet. The experiments were conducted on the NYU Depth V2 dataset, using multiple input sizes to investigate the impact of resolution on the results. The evaluated metrics include RMSE, rel, log10, and accuracy thresholds δ_1, δ_2 e δ_3.. The results reveal that the combination of U-Net with the CoaT-Lite (M) encoder outperforms all other evaluated approaches and network combinations. The implementation is available at: https://github.com/duraes-antonio/seg_depth.

ABSTRACT Monocular depth estimation is a fundamental task in computer vision, with applications in areas such as augmented reality, autonomous navigation, and medical procedures. This work investigates the reuse of architectures originally developed for semantic segmentation, such as U-Net and its variants, in the task of depth estimation. Combinations of U-Net and UNet++ architectures with different encoders were implemented and evaluated, covering convolutional networks, Transformer networks, and hybrid architectures, including VGG-19 (BN), Xception, Inception-ResNet-v2, Mixed Transformer (B2), CoaT, CoAtNet, and TransUnet. The experiments were conducted on the NYU Depth V2 dataset, using multiple input sizes to investigate the impact of resolution on the results. The evaluated metrics include RMSE, rel, log10, and accuracy thresholds δ1, δ2, and δ3. The results reveal that the combination of U-Net with the CoaT-Lite (M) encoder outperforms all other evaluated approaches and network combinations. The implementation is available at: https://github.com/duraes-antonio/seg_depth. Keywords: Monocular depth estimation. U-Net. UNet++. Transformers.


Collections: