비전 트랜스포머 및 메트릭 러닝 기반 수중 소나 이미지 분류
Underwater sonar image classification is essential for maritime surveillance, autonomous navigation, and underwater target identification, where optical sensing is often restricted by turbidity and light attenuation. To enhance the robustness of sonar-based perception under such challenging conditions, this study proposes a metric-enhanced Vision Transformer (ViT) framework that integrates Siamese-based representation alignment with distance-regularized classification. In the first stage, a Siamese pre-training strategy is employed to align embeddings of positive pairs, encouraging directionally consistent representations that improve class separability even under severe noise and viewpoint variations. In the second stage, the pretrained ViT encoder is frozen, and five classifiers—Linear, Cosine, Proxy, and their Mahalanobis-regularized variants—are systematically evaluated to investigate the effect of embedding normalization and distributional alignment. Experimental results on the UATD dataset demonstrate that the Siamese-trained ViT produces more stable and discriminative features than both ResNet-50 and standard ViT-S. Among the classifiers, the Mahalanobis-regularized cosine classifier achieves the highest, showing significant reductions in misclassification between visually similar classes such as cube and square cage. Overall, the proposed approach highlights the effectiveness of combining ViT with metric learning and covariance-aware distance normalization for underwater sonar image recognition. The results suggest that metric-enhanced transformers offer a robust and generalizable foundation for sonar-based perception in real maritime environments.