대규모 언어모델 기반 학술 토픽 설명 생성의 정량․정성 평가 연구: OpenAlex 사례
This study develops a generative AI-based system for automatically generating scholarly topic descriptions within the OpenAlex database and evaluates its performance. Although OpenAlex provides concise topic descriptions, they lack contextual richness and informational coverage, limiting researchers’ ability to quickly grasp the semantic relevance of each topic. To address this issue, this study generated new descriptions for a total of 4,516 topics by utilizing metadata attributes—topic_id, topic_name, description, and keywords—and compared them with the original descriptions. Multiple large language models (LLMs), including GPT, LLaMA, and Mistral, were employed, and a consistent prompt-engineering scheme was designed to ensure the reproducibility of model comparison. A standardized evaluation framework integrating quantitative and qualitative indicators was proposed. Quantitative evaluation included keyword-based Precision, Recall, and F1 scores, ROUGE-L, Specter2 embedding-based cosine similarity, and BERTScore. Qualitative evaluation was conducted using LLM-based pairwise comparison, assessing Relevance, Coverage, and Clarity, with relative rankings determined through the Elo rating system. Furthermore, the Friedman test and Wilcoxon signed-rank test were applied to verify statistical significance. Experimental results revealed distinctive strengths and weaknesses across models, providing a benchmarking foundation for improving automated content generation in scholarly databases such as OpenAlex. The proposed evaluation framework also offers a reproducible and consistent basis for assessing various generative models, contributing to both academic research and practical system development.