Quantitative and Qualitative Evaluation of Scholarly Topic Description Generation Using Large Language Models: A Case Study on OpenAlex

Sanggook Kim; Hyuk Hahn; Taehoon Kwon

논문 상세보기

Quantitative and Qualitative Evaluation of Scholarly Topic Description Generation Using Large Language Models: A Case Study on OpenAlex KCI 등재

대규모 언어모델 기반 학술 토픽 설명 생성의 정량․정성 평가 연구: OpenAlex 사례

Sanggook Kim, Hyuk Hahn, Taehoon Kwon

언어KOR
URLhttps://db.koreascholar.com/Article/Detail/449278

구독 기관 인증 시 무료 이용이 가능합니다. 4,900원

한국산업경영시스템학회지 (Journal of Society of Korea Industrial and Systems Engineering)

Vol.48 No.4 (2025.12)
pp.79-94

한국산업경영시스템학회 (Society of Korea Industrial and Systems Engineering)

초록

This study develops a generative AI-based system for automatically generating scholarly topic descriptions within the OpenAlex database and evaluates its performance. Although OpenAlex provides concise topic descriptions, they lack contextual richness and informational coverage, limiting researchers’ ability to quickly grasp the semantic relevance of each topic. To address this issue, this study generated new descriptions for a total of 4,516 topics by utilizing metadata attributes—topic_id, topic_name, description, and keywords—and compared them with the original descriptions. Multiple large language models (LLMs), including GPT, LLaMA, and Mistral, were employed, and a consistent prompt-engineering scheme was designed to ensure the reproducibility of model comparison. A standardized evaluation framework integrating quantitative and qualitative indicators was proposed. Quantitative evaluation included keyword-based Precision, Recall, and F1 scores, ROUGE-L, Specter2 embedding-based cosine similarity, and BERTScore. Qualitative evaluation was conducted using LLM-based pairwise comparison, assessing Relevance, Coverage, and Clarity, with relative rankings determined through the Elo rating system. Furthermore, the Friedman test and Wilcoxon signed-rank test were applied to verify statistical significance. Experimental results revealed distinctive strengths and weaknesses across models, providing a benchmarking foundation for improving automated content generation in scholarly databases such as OpenAlex. The proposed evaluation framework also offers a reproducible and consistent basis for assessing various generative models, contributing to both academic research and practical system development.

키워드

Large Language Models (LLMs)Topic Description Generation OpenAlex; Scholarly Database Evaluation Benchmarking Framework

1. 서 론
2. 선행 연구
    2.1 학술 토픽 설명 자동화 연구 동향
    2.2 대규모 언어모델을 활용한 학술 텍스트 생성및 평가 연구 동향
    2.3 텍스트 생성 품질 평가 연구
3. 연구 방법론
    3.1 연구 개요
    3.2 연구 데이터 구성
    3.3 설명문 생성 모델 구성
    3.4 프롬프트 엔지니어링 설계
    3.5 평가 프레임워크 설계
    3.6 통계적 유의성 검정
4. 실험 및 결과
    4.1 실험 개요
    4.2 정량적 평가 결과
    4.3 정성적 평가 결과
    4.4 통계적 검정 결과
    4.5 종합 분석 및 시사점
5. 결 론
Acknowledgement
References

저자

Sanggook Kim(Global R&D Analysis Center, Korea Institute of Science and Technology Information) | 김상국 (한국과학기술정보연구원 글로벌R&D분석센터)
Hyuk Hahn(Global R&D Analysis Center, Korea Institute of Science and Technology Information) | 한혁 (한국과학기술정보연구원 글로벌R&D분석센터)
Taehoon Kwon(Global R&D Analysis Center, Korea Institute of Science and Technology Information) | 권태훈 (한국과학기술정보연구원 글로벌R&D분석센터)

같은 권호 다른 논문