One of the most frequently performed tasks in human-robot interaction (HRI), intelligent vehicles, and security systems is face related applications such as face recognition, facial expression recognition, driver state monitoring, and gaze estimation. In these applications, accurate head pose estimation is an important issue. However, conventional methods have been lacking in accuracy, robustness or processing speed in practical use. In this paper, we propose a novel method for estimating head pose with a monocular camera. The proposed algorithm is based on a deep neural network for multi-task learning using a small grayscale image. This network jointly detects multi-view faces and estimates head pose in hard environmental conditions such as illumination change and large pose change. The proposed framework quantitatively and qualitatively outperforms the state-of-the-art method with an average head pose mean error of less than 4.5° in real-time.
We present a region-based approach for accurate pose estimation of small mechanical components. Our algorithm consists of two key phases: Multi-view object co-segmentation and pose estimation. In the first phase, we explain an automatic method to extract binary masks of a target object captured from multiple viewpoints. For initialization, we assume the target object is bounded by the convex volume of interest defined by a few user inputs. The co-segmented target object shares the same geometric representation in space, and has distinctive color models from those of the backgrounds. In the second phase, we retrieve a 3D model instance with correct upright orientation, and estimate a relative pose of the object observed from images. Our energy function, combining region and boundary terms for the proposed measures, maximizes the overlapping regions and boundaries between the multi-view co-segmentations and projected masks of the reference model. Based on high-quality co-segmentations consistent across all different viewpoints, our final results are accurate model indices and pose parameters of the extracted object. We demonstrate the effectiveness of the proposed method using various examples.
Facial feature extraction and tracking are essential steps in human-robot-interaction (HRI) field such as face recognition, gaze estimation, and emotion recognition. Active shape model (ASM) is one of the successful generative models that extract the facial features. However, applying only ASM is not adequate for modeling a face in actual applications, because positions of facial features are unstably extracted due to limitation of the number of iterations in the ASM fitting algorithm. The unaccurate positions of facial features decrease the performance of the emotion recognition. In this paper, we propose real-time facial feature extraction and tracking framework using ASM and LK optical flow for emotion recognition. LK optical flow is desirable to estimate time-varying geometric parameters in sequential face images. In addition, we introduce a straightforward method to avoid tracking failure caused by partial occlusions that can be a serious problem for tracking based algorithm. Emotion recognition experiments with k-NN and SVM classifier shows over 95% classification accuracy for three emotions: "joy", "anger", and "disgust".
This paper studies how to combine devices such as monocular/stereo cameras, motors for panning/tilting, fisheye lens and convex mirrors, in order to solve vision-based robotic problems. To overcome the well-known trade-offs between optical properties, we present two mixed versions of the new systems. The first system is the robot photographer with a conventional pan/tilt perspective camera and fisheye lens. The second system is the omnidirectional detector for a complete 360-degree field-of-view surveillance system. We build an original device that combines a stereo-catadioptric camera and a pan/tilt stereo-perspective camera, and also apply it in the real environment. Compared to the previous systems, we show benefits of two proposed systems in aspects of maintaining both high-speed and high resolution with collaborative moving cameras and having enormous search space with hybrid configuration. The experimental results are provided to show the effectiveness of the mixing collaborative and hybrid systems.
One of the requirements for autonomous vehicles on off-road is to move stably in unstructured environments. Such capacity of autonomous vehicles is one of the most important abilities in consideration of mobility. So, many researchers use contact and/or non-contact methods to determine a terrain whether the vehicle can move on or not. In this paper we introduce an algorithm to classify terrains using visual information(one of the non-contacting methods). As a pre-processing, a contrast enhancement technique is introduced to improve classification of terrain. Also, for conducting classification algorithm, training images are grouped according to materials of the surface, and then Bayesian classification are applied to new images to determine membership to each group. In addition to the classification, we can build Traversability map specified by friction coefficients on which autonomous vehicles can decide to go or not. Experiments are made with Load-Cell to determine real friction coefficients of various terrains.
Recently, many vision-based navigation methods have been introduced as an intelligent robot application. However, many of these methods mainly focus on finding an image in the database corresponding to a query image. Thus, if the environment changes, for example, objects moving in the environment, a robot is unlikely to find consistent corresponding points with one of the database images. To solve these problems, we propose a novel navigation strategy which uses fast motion estimation and a practical scene recognition scheme preparing the kidnapping problem, which is defined as the problem of re-localizing a mobile robot after it is undergone an unknown motion or visual occlusion. This algorithm is based on motion estimation by a camera to plan the next movement of a robot and an efficient outlier rejection algorithm for scene recognition. Experimental results demonstrate the capability of the vision-based autonomous navigation against dynamic environments.
This paper presents a new sensor system. CALOS, for motion estimation and 3D reconstruction. The 2D laser sensor provides accurate depth information of a plane, not the whole 3D structure. On the contrary, the CCD cameras provide the projected image of whole 3D scene, not the depth of the scene. To overcome the limitations, we combine these two types of sensors, the laser sensor and the CCD cameras. We develop a motion estimation scheme appropriate for this sensor system.In the proposed scheme, the motion between two frames is estimated by using three points among the scan data and their corresponding image points, and refined by non-linear optimization. We validate the accuracy of the proposed method by 3D reconstruction using real images. The results show that the proposed system can be a practical solution for motion estimation as well as for 3D reconstruction.
In this paper, we present a practical palce and object recognition method for guiding visitors in building environments. Recognizing palces or objects in real world can be a difficult problem due to motion blur and camera noise. In this work, we present a modeling method based on the bidirectional interactionbetween places and objects for simulataneous reinforcement for the robust recognition. The unification of visual context including scene context, object context, and temporal context is also. The proposed system has been tested to guide visitors in a large scale building environment(10 topological places, 80 3D objects)
In this paper, we introduce visual contexts in terms of types and utilization methods for robust object recognition with intelligent mobile robots. One of the core technologies for intelligent robots is visual object recognition. Robust techniques are strongly required since there are many sources of visual variations such as geometric, photometric, and noise. For such requirements, we define spatial context, hierarchical context, and temporal context. According to object recognition domain, we can select such visual contextx. We also propose a unified framework which can utilize the whole contexts and validates it in real working environment. Finally, we also discuss the furture research directions of object recognition technologies for intelligent robots.