Content area
Full text
1. Introduction
With the continuous development of information technology, researchers begin to use interdisciplinary research methods of digital humanities (DH) to open up a new paradigm for humanities research. Consequently, a great deal of DH data has been accumulated, such as subject databases, electronic archives, knowledge bases, webpages and so on. The multi-source heterogeneous data, which are difficult to read and understand by computers, increase the difficulty and workload of DH research. Therefore, it is necessary to extract machine-understandable knowledge from the data and organize the extracted knowledge into a knowledge graph to support DH research.
Tang poetry is the representative of traditional Chinese literature and one of the highest achievements of Chinese poetry creation. There are more than 50,000 poetries written by over 2,200 poets in the Tang dynasty, which have a far-reaching influence on Chinese culture and even world culture. At present, there are a large number of experts in China who conduct researches on Tang poetry and have made fruitful achievements (Li, 2010).
As one of the important fields of DH, Tang poetry has accumulated a large amount of data resources. However, these resources are scattered, sparse and lack effective organization. In the field of Tang poetry, knowledge extraction can provide a solution to transform multi-source heterogeneous data into “intelligent” linked data, i.e. entities and relationships between them. This, in turn, provides a solid foundation for knowledge association and reasoning, which supports DH studies and intelligent applications.
In this paper, we study the problem of extracting knowledge from unstructured texts of Tang poetry. However, it is not an easy task because of the large scale of data and the unique characteristics of Tang poetry. Firstly, Tang poetry is a type of ancient Chinese text that has unique terms of words, sentence patterns, grammar and rhyme schemes. Secondly, state-of-the-art knowledge extraction techniques such as machine learning and deep learning are lack of training instances and prior knowledge (Alani et al., 2003), thus cannot be directly applied to such humanities research. Last but not least, knowledge extraction relying on domain experts is costly and not scalable to a large amount of data. Although some studies (Plaisant, 2006) combined the efforts of domain experts and computer systems in the DH field, they still cannot...





