한문 고전 토크나이징을 위한 한문교육연구소 어휘 매칭 툴과 데이터 개선 방향

신윤수; 김다미; 최솔잎

논문 상세보기

한문 고전 토크나이징을 위한 한문교육연구소 어휘 매칭 툴과 데이터 개선 방향 KCI 등재후보

Directions for Improving the Vocabulary Matching Tool and Data of the Institute for Han-Character Education Research for Tokenizing Classical Chinese Texts

신윤수, 김다미, 최솔잎

언어KOR
URLhttps://db.koreascholar.com/Article/Detail/432462

구독 기관 인증 시 무료 이용이 가능합니다. 6,900원

Journal of Applied Studies on Singograph and Literary Sinitic

Vol. 2 (2023.12)
pp.101-129

단국대학교 한문교육연구소 (Han-character Education Research Center)

초록

본고는 단국대학교 부설 한문교육연구소에서 개발 중인 특수 어휘 매칭 툴을 구동시켜 어휘 데이터의 문제점과 개선 방향을 논의하기 위해 작성되었다. 한문교육연구소 개발 어휘 매칭 툴은 최종적으로 한문 고전 텍스트를 토크나이징(Tokenizing) 하기 위한 것이며, 특수 어휘의 매칭은 전체 텍스트를 토크나이징을 하기 위한 첫 단계라고 할 수 있다. 이 어휘 매칭 툴 실행 결과를 MARKUS 자동 마크업과 비교함으로써 매칭 툴과 그 데이터의 장단점을 분석하고, 이 과정에 발견된 문제점에 대해 보완할 수 있는 방향을 제시하였다. 한문교육연구소 어휘 매칭 툴은 한문고전에 특화된 도구로서 중요한 역할을 할 수 있으며, 앞으로 한문 고전의 토크나이징 에도 기여할 것으로 기대된다. 하지만 현재 상태에서는 여러 가지 보완이 필요하다. 우선, 한국 고유의 지명과 인명 데이터를 추가할 필요가 있다. 현재 데이터는 주로 중국의 어휘에 집중되어 있어 한국 고유 어휘가 부족한 상황이다. 추가 어휘데이터를 구축함으로써 해결할 수 있을 것으 로 보인다. 또 별칭의 매칭 문제 등을 해결할 필요가 있다.

This paper was written to discuss the problems and directions for improvement of the vocabulary data by operating a specialized vocabulary matching tool currently under development at the Institute for Han-Character Education Research, affiliated with Dankook University. The vocabulary matching tool developed by the Institute for Han-Character Education Research is ultimately intended for tokenizing classical Chinese texts, and the matching of special vocabulary can be considered the first step in tokenizing the entire text. By comparing the results of running this vocabulary matching tool with the automatic markup of MARKUS, the strengths and weaknesses of the matching tool and its data were analyzed, and directions for addressing the problems identified in this process were proposed. The vocabulary matching tool of the Institute for Han-Character Education Research, being specialized for classical Chinese texts, is expected to play an important role and contribute to the tokenizing of classical Chinese texts in the future. However, it currently requires several enhancements. Firstly, there is a need to add data on Korean-specific place names and personal names. The current data is mainly focused on Chinese vocabulary, resulting in a shortage of Korean-specific vocabulary. This issue could be resolved by constructing additional vocabulary data. Furthermore, it is necessary to solve issues such as the matching of aliases.

키워드

어휘데이터 태깅 매칭 마크업 디지털인문학 Vocabulary data Tagging Matching Markup Digital humanities

Ⅰ. 서론
Ⅱ. 한문교육연구소 어휘 매칭 툴과 어휘 데이터
    1. 어휘 매칭 툴
    2. 한문교육연구소 어휘 데이터
Ⅲ. 매칭 툴 실행과 MARKUS 자동 마크업과의 비교
    1. 일반 산문
    2. 역사문헌
    3. 관각문
Ⅳ. 결론 : 문제점 및 보완방향
    1. 한국고전번역원, <고전용어시소러스>
    2. 고려대학교, <유서 어휘 목록>
    3. 한국학중앙연구원, <조선조 관직 정보>
    4. 한국학중앙연구원, <장서각 소장 의궤 수록 복식 사전>
References 參考文獻

저자

신윤수(단국대학교 한문교육연구소 연구원) | Yunsoo Shin (Researcher, Institute for Han-Character Education Research)
김다미(단국대학교 한문교육연구소 연구원) | Dami Kim (Researcher, Institute for Han-Character Education Research)
최솔잎(단국대학교 한문교육연구소 연구원) | Solip Choi (Researcher, Institute for Han-Character Education Research)

같은 권호 다른 논문