Content area
The Retrieval-Augmented Generation (RAG) framework has gained attention as a fast and cost-effective method for enhancing the performance of large language models (LLMs). However, its performance remains limited in minority languages such as Korean, and this issue is exacerbated in specialized fields like construction. To address these limitations, this study proposes a dataset construction method that allows low-cost fine-tuning of embedding models originally trained on English-based data. By applying this method in the construction domain, we achieved a top-1 document retrieval accuracy of 58.65%, surpassing the performance of a commercial embedding model provided by OpenAl. We further analyzed how improvements in the embedding model influence the overall RAG pipeline and present both a dataset creation approach and an appropriate evaluation strategy for testing RAG's performance. Our findings suggest that this method can significantly enhance technical efficiency by providing a foundation for diverse language users to effectively utilize RAG in the construction domain.
Details
1 School of Civil and Environmental Engineering, Yonsei University
2 Department of Building, Civil, and Environmental Engineering, Concordia University