농업 분야 컴퓨터 비전(Computer Vision) 기술 확산으로 고품질 학습 데이터 확보가 필수적이나, 기존의 수동 데이터 구축 방식은 많은 시간과 비용이 소요되는 한계가 있다. 이에 본 연구는 최신 멀티모달 파운데이션 모델인 SAM3(Segment Anything Model 3)를 기반으로 반자동 어노테이션 시스템을 개발하였다. 제안 시스템은 (1) 텍스트 프롬프트 기반 객체 인 식, (2) SAM3 기반 정밀 마스크 생성 및 학습 가능한 폴리곤 좌표 변환, (3) 사용자 검증의 3단계로 구성되며 GUI로 구현 되었다. 600장 이미지 평가 결과, SAM3는 92.9%의 매칭률 과 0.790의 평균 정밀도(mAP)를 달성하였으며, 데이터셋 구 축 시간을 수동 작업 대비 96~98% 단축시켰다. 이는 SAM+ CLIP, Grounding DINO+SAM 등 기존 파운데이션 모델 대 비 정확도와 효율성 모든 면에서 월등한 성능이다. 본 연구는 파운데이션 모델의 제로샷 성능을 활용해 농업 데이터 레이블 링 효율을 개선하고 관련 AI 연구 가속화에 기여할 것으로 기 대된다.
A robot usually adopts ANN (artificial neural network)-based object detection and instance segmentation algorithms to recognize objects but creating datasets for these algorithms requires high labeling costs because the dataset should be manually labeled. In order to lower the labeling cost, a new scheme is proposed that can automatically generate a training images and label them for specific objects. This scheme uses an instance segmentation algorithm trained to give the masks of unknown objects, so that they can be obtained in a simple environment. The RGB images of objects can be obtained by using these masks, and it is necessary to label the classes of objects through a human supervision. After obtaining object images, they are synthesized with various background images to create new images. Labeling the synthesized images is performed automatically using the masks and previously input object classes. In addition, human intervention is further reduced by using the robot arm to collect object images. The experiments show that the performance of instance segmentation trained through the proposed method is equivalent to that of the real dataset and that the time required to generate the dataset can be significantly reduced.
We present a region-based approach for accurate pose estimation of small mechanical components. Our algorithm consists of two key phases: Multi-view object co-segmentation and pose estimation. In the first phase, we explain an automatic method to extract binary masks of a target object captured from multiple viewpoints. For initialization, we assume the target object is bounded by the convex volume of interest defined by a few user inputs. The co-segmented target object shares the same geometric representation in space, and has distinctive color models from those of the backgrounds. In the second phase, we retrieve a 3D model instance with correct upright orientation, and estimate a relative pose of the object observed from images. Our energy function, combining region and boundary terms for the proposed measures, maximizes the overlapping regions and boundaries between the multi-view co-segmentations and projected masks of the reference model. Based on high-quality co-segmentations consistent across all different viewpoints, our final results are accurate model indices and pose parameters of the extracted object. We demonstrate the effectiveness of the proposed method using various examples.