A Voice-Driven Compound Emotion Digital Human Interaction System

Lei Fang; Fan Yang; Mincheol Whang

논문 상세보기

A Voice-Driven Compound Emotion Digital Human Interaction System KCI 등재

Lei Fang, Fan Yang, Mincheol Whang

언어ENG
URLhttps://db.koreascholar.com/Article/Detail/450454

구독 기관 인증 시 무료 이용이 가능합니다. 4,300원

감성과학 (Korean Journal of the science of Emotion & sensibility)

Vol.29 No.1 (2026.03)
pp.43-54

한국감성과학회 (The Korean Society For Emotion & Sensibility)

초록

본 연구는 실시간 인간–컴퓨터 대화에서 가상 인물의 자연스러움과 감성적 공명(emotional resonance)을 향상시키 기 위한 음성 기반 복합 감성 디지털 휴먼 상호작용 프레임워크를 제안한다. 시스템은 먼저 음성 감성 인식(Speech Emotion Recognition, SER) 모듈을 통해 사용자의 음성으로부터 정서적 특징을 추출하고, 이후 GPT 기반 모델을 이용하여 세분화된 복합 감성 가중치 분석을 수행한다. 분석 결과는 JSON 형식으로 구조화되어 로컬 API를 통해 Unreal Engine 5 (UE5) 렌더링 환경으로 전달되며, 이를 통해 음성 파라미터를 MetaHuman 얼굴 동작 단위(Action Units, AUs)에 동적으로 매핑한다. 시스템의 효율성을 검증하기 위해, 총 40명의 참가자가 자연성 및 현실감, 복합 감성 표현의 효과성, 감성적 공명, 전체 상호작용 성능의 네 가지 지각적 차원을 평가하였다. 분석 결과, 모든 차원의 평균 점수가 중립 수준(3점)을 유의하게 상회하였으며(p<.001), Cronbach's α 값이 0.70을 초과하여 척도의 내부 일 관성이 양호함을 확인하였다. 또한, 효과 크기(Cohen’s d > 0.8)는 시스템이 감성 표현력과 상호작용 유창성에서 뚜 렷한 이점을 지님을 보여주었다. 종합적으로, 본 연구의 프레임워크는 음성 기반 복합 감성 생성을 통해 교차 모달 감성 전달(cross-modal emotional transmission)을 실현하며, 향후 감성컴퓨팅 및 디지털 휴먼 상호작용 연구를 위한 확장 가능한 기술적 경로를 제시한다.

To enhance the naturalness and emotional resonance of virtual characters in real-time human–computer dialogue, this study proposes a speech-driven framework for compound emotional digital–human interaction. The system first employs a speech emotion recognition module to extract affective features from the user’s voice, followed by a fine-grained compound emotion weight analysis using a GPT-based model. The results are structured in JSON format and transmitted via a local API to the Unreal Engine 5 rendering environment, enabling dynamic mapping from speech parameters to MetaHuman facial action units. To evaluate the system’s effectiveness, 40 participants rated four perceptual dimensions: naturalness and realism, effectiveness of compound emotion expression, emotional resonance, and overall interaction performance. Findings reveal that all four dimensions scored significantly higher than the neutral level (p < 0.001), with Cronbach’s α exceeding 0.70, indicating good internal consistency. Moreover, large effect sizes (Cohen’s d > 0.8) demonstrate the system’s considerable advantages in emotional expressiveness and interaction fluency. Overall, this framework achieves cross-modal emotional transmission through speech-driven compound emotion generation, providing an extensible technical pathway for future research in affective computing and digital–human interaction.

키워드

감성공학 감성컴퓨팅 실험설계 가상현실 인간공학 Affective Engineering Affective Computing Experimental Design VR Human Factors Engineering

Abstract
요 약
1. Introduction
2. Related Work
    2.1. Speech Emotion Recognition (SER)
    2.2. Digital Human Facial Expression Generation
    2.3. Multimodal Fusion of Emotion, Expression,and Language
    2.4. Comparison and Contributions
3. Method
    3.1. Overall Framework Overview
    3.2. Core Framework
4. Results
    4.1. Data and Preprocessing
    4.2. Questionnaire Analysis Results ReliabilityAnalysis
    4.3. Descriptive Statistics and Normality
    4.4. Evaluation One-Sample t-Test
    4.5. Group Comparison
    4.6. Discussion
5. Conclusion
Acknowledgements
REFERENCES

저자

Lei Fang(Ph.D. student, Department of Emotion Engineering, Sangmyung University)
Fan Yang(M.S. student, Department of Emotion Engineering, Sangmyung University)
Mincheol Whang(Associate professor, Department of Human-Centered Artificial Intelligence, Sangmyung University) Corresponding author

같은 권호 다른 논문