Content area
Purpose - A grid information retrieval model has benefits for sharing resources and processing mass information, but cannot handle conceptual heterogeneity without integration of semantic information. The purpose of this research is to propose a concept-based retrieval mechanism to catch the user's query intentions in a grid environment. This research re-ranks documents over distributed data sources and evaluates performance based on the user judgment and processing time. Design/methodology/approach - This research uses the ontology lookup service to build the concept set in the ontology and captures the user's query intentions as a means of query expansion for searching. The Globus toolkit is used to implement the grid service. The modification of the collection retrieval inference (CORI) algorithm is used for re-ranking documents over distributed data sources. Findings - The experiments demonstrate that this proposed approach successfully describes the user's query intentions evaluated by user judgment. For processing time, building a grid information retrieval model is a suitable strategy for the ontology-based retrieval model. Originality/value - Most current semantic grid models focus on construction of the semantic grid, and do not consider re-ranking search results from distributed data sources. The significance of evaluation from the user's viewpoint is also ignored. This research proposes a method that captures the user's query intentions and re-ranks documents in a grid based on the CORI algorithm. This proposed ontology-based retrieval mechanism calculates the global relevance score of all documents in a grid and displays those documents with higher relevance to users.
Introduction
In our information society the internet has become the greatest human knowledge base ([30] Segev et al. , 2007). The volume of information published on the internet is growing rapidly, as are the number of users accessing the web. Many electronic libraries are playing a key role in allowing their information depositories to be accessed easily and efficiently via the internet.
However the performance of a centralised storage database for an electronic library may decrease as the size of the document collection and the number of users increase ([2] Basirat and Khan, 2010). In general different electronic libraries define different search criteria to access their own document collections. In addition users have to learn how to use the new search model for each of the various different online electronic libraries. Thus, in order to access several electronic libraries simultaneously, the integration of different distributed electronic libraries via a federated service that provides a unified interface for users is necessary ([41] Trnkoczy et al. , 2006).
In recent times grid and cloud technologies have provided a novel means for the sharing and coordinated use of diverse resources in dynamic, distributed and virtual organisations ([38] Stanoevska-Slabeva et al. , 2010). Even though there is no consensus on the difference between them, grid computing is treated as the foundation of cloud computing ([13] Foster et al. , 2008). Grid computing can support the integration of all distributed heterogeneous online databases in one platform and facilitate the discovery process easily and efficiently ([12] Foster et al. , 2001). However the current search mechanisms for use on the grid or cloud are mostly based on keyword matching ([45] Vega-Gorgojo et al. , 2005). The traditional keyword-based search mechanism retrieves documents with less relevance to the queries ([15] Haav and Lubi, 2001; [34] Sieg et al. , 2007). The main issue is that a keyword-based search model treats a term or a word only as a sequence of binary codes. It does not take into account the concept of the term and the relationship between terms as a cue for searching. Thus it involves word form mapping between queries and data resources in general. Even if some linguistic techniques, such as word stemming and removal of stop words are used, this information retrieval model is still based on a keyword searching mechanism ([15] Haav and Lubi, 2001).
A grid cannot deal with conceptual heterogeneity without integration of semantic information ([50] Zheng et al. , 2011). The semantic capability could be provided by the ontology, which is treated as a set of logical axioms designed to summarise the concept ([27] Ranwez et al. , 2010). That is, the ontology conceptualises the represented knowledge based on a set of concepts and relationships between concepts, which are terms for describing knowledge ([43] van der Vet, 1998). Although the ontology is able to provide semantic concepts, it has limitations regarding resource sharing and mass information processing ([50] Zheng et al. , 2011). The semantic grid ([9] de Roure et al. , 2005) is an extension of the current grid technique in which information and services are well defined and smoothly integrated. In recent work several researchers have focussed on the construction of the semantic grid, but the significance of evaluation from the viewpoint of users is ignored ([21] Liu and Xiao, 2009; [49] Zhang and Yin, 2010; [48] Zhang and Ting, 2010; [50] Zheng et al. , 2011).
In this research an ontology-based grid information retrieval (OGIR) framework is proposed to extract relevant documents from several distributed document collections based on concept matching, which can efficiently provide users with documents that are more relevant to their area of interest. In particular the research focuses on the retrieval of biomedical documents, i.e. the BioMed central corpus (see www.biomedcentral.com), to assess the retrieval performance of OGIR. As measuring the ability to meet the user's information needs is important for an information retrieval model ([37] Spink and Wilson, 1999; [36] Spink, 2002; [42] Tsai et al. , 2006), the judgment of 16 doctors and the processing time measure are used.
The remainder of this paper is organised as follows. The following section briefly describes the information retrieval models and gives an overview of related research work using ontology and grid techniques in the field of information retrieval. The next section describes the detailed architecture of this proposed OGIR mechanism in the grid environment. Four experimental scenarios and the designs for the experiments are subsequently described. Then the results of the experiments and evaluation approaches are presented. Finally a conclusion and some possible future research areas are given.
Related work
An ontology is a formal, explicit specification of a shared conceptualisation, which contains a rich set of concepts and their interrelationships ([39] Studer et al. , 1998). Many researchers have used ontologies in the field of information retrieval in order to build concept-based retrieval mechanisms (e.g. [10] Ferrara et al. , 2006; [6] Corby et al. , 2006; [26] Nguyen and Phan, 2009; [18] Lim et al. , 2009). Semantic portals provide simple search functions that may be better characterised as semantic data retrieval, rather than semantic information retrieval ([5] Castells et al. , 2004). They return only ontology instances rather than documents, and results are not ranked.
[28] Rinaldi (2009) used a general domain ontology, i.e. WordNet ([25] Miller, 1995), to build the dynamic semantic network information retrieval model. This integrates subject-keywords with domain-keywords to provide greater performance, but the specific sense in WordNet has to be chosen by the user. [23] Martins et al. (2008) used a domain-specific ontology for information retrieval on a ubiquitous medical learning environment. They employed an implicit relevance feedback approach using contextual information to build semantic indexes. [46] Wenjie et al. (2004) proposed an ontology-based architecture for an intelligent information retrieval system, which constructs a hierarchical tree model of a domain dependent ontological repository. [32] Shah et al. (2002) retrieved documents consisting of both free text and semantic annotations. Their proposed framework is capable of dealing with the interdependency of the search. However none of this research considers the significance of the ranking method for retrieved documents. [24] Mayfield and Finin (2003) combined the ontology and text-based information retrieval in sequence and in a cyclical way. The relationship of terms in the hierarchy is used for query expansion. Based on their work this research treats the semantic search as complementary to the keyword-based search, provided that a specific domain ontology is available.
Recent advances in information technology have strengthened the significance of information retrieval from distributed and other types of databases. [31] Seo and Croft (2008) observed that a blog site search is similar to resource selection in distributed information retrieval. Grid computing is a novel technique, which provides not only enormous computational power but also massive virtual data storage for a collection of heterogeneous distributed resources ([17] Johanstona, 2002; [11] Foster and Kesselman, 2004). In other words the grid is able to federate heterogeneous resources, which may be located within different organisations, and provide transparent, secure, and high-performance access to such distributed and various resources. [33] Shi et al. (2002) proposed three grid topologies (Figure 1 [Figure omitted. See Article Image.]) - i.e. flat, hierarchical, and mesh - for the information grid, and four evaluation measures - i.e. scalability, availability, maintainability, and information retrieval performance - for these topologies. They found that the flat topology has high availability and maintainability but poor scalability and information retrieval performance. The hierarchical topology is much more scalable than the flat topology, but its information retrieval performance is also low. The mesh topology has high availability, scalability and information retrieval performance, but poor maintainability.
[44] Vateekul and Rungsawang (2004) proposed a distributed information retrieval prototype, named DWORM, on a grid environment. The DWORM project distributes its work to each computing node to handle web data in its local zone and it improves the grid environment through the geographic dispersion of web services. However most of these systems have adopted the exact matching mode for resource discovery, but have not employed the semantic search.
Recent research has focussed on further analysing and understanding search keywords (e.g. [40] Sun et al. , 2005, [1] Aloisio et al. , 2005). With the hierarchical representation of keywords in the search results, users can easily trace and identify the relevant concepts for the specific topic. An ontology could help localise the right concept to be searched for as opposed to identification of a mere label naming a search table. An ontology includes definitions of basic concepts in the domain, and the semantic relationships between them, which should be interpretable both by machines and humans. For example [40] Sun et al. (2005) proposed a prototype grid system named RSIsGrid for semantic-based remote sensing image retrieval using ontology and grid technologies. The research aims at constructing the grid as a tool to store semantic instances of the remote sensing images and implement remote sensing image data retrieval by knowledge inference through ontology. [1] Aloisio et al. (2005) proposed a semantic data access and integration service, based on the grid paradigm, for the bioinformatics domain. The proposed service uses ontology for correlating different data sets.
The notion of the semantic grid was proposed by [8] de Roure et al. (2001). The semantic grid is an extended grid in which information and services are given well-defined concepts so that computers and people work in cooperation. Several researchers have recently focused on construction of the semantic grid, (e.g. [21] Liu and Xiao, 2009; [49] Zhang and Yin, 2010; [48] Zhang and Ting, 2010; [50] Zheng et al. , 2011). All of these research approaches provide semantic-based information retrieval on the grid through ontology. However the function of ontology is for the reference of query inference only, and the significance of evaluation from the viewpoint of the users is ignored.
This research proposes a semantic-based data discovery algorithm based on the structure of ontology in the grid environment. Through the hierarchical structure of ontology, the user's query intentions are described and all search results are re-ranked in accordance with the user's interests.
The proposed OGIR framework
The aim of this research is to provide a concept-based information retrieval framework in the grid environment. For this purpose a novel ontology-based grid information retrieval framework (OGIR) is proposed. This framework can be divided into three modules: the concept extraction module, the grid information retrieval module and the document ranking module. Figure 2 [Figure omitted. See Article Image.] shows the overview of this proposed framework. Initially the concept extraction module extracts the user's query intentions. By mapping the user's queries to the specific domain ontology, the concept extraction module generates a user concept tree (UCT) and forms a relevant concept set. This concept set is treated as query expansion in the field of information retrieval. Next the grid information retrieval module retrieves the relevant information from several distributed data sources. Finally the document ranking module calculates the global similarities between the concept set and all documents returned from those distributed data sources.
Extracting concepts from the ontology
In order to extract the concept set of the user's query intentions, the user concept tree (UCT) is defined, which is part of a specific ontology. The constituent components of the UCT are the base concept, the derived concept, the concept linking path and the concept level. The base concept is the original query term found in the ontology hierarchy. The derived concept is the relevant concept derived from the base concept in the ontology hierarchy. These two concepts form the user concept tree. The concept linking path is the semantic connection between two concepts. The concept level is the parameter for determining the scope of the user concept tree. The user concept tree provides users with more specific concepts derived from some levels of the base concept in the ontology hierarchy. For example, Figure 3 [Figure omitted. See Article Image.] shows part of the hierarchical structure of a medical ontology. The query term "gastric disease" is found in this ontology. Thus "gastric disease" is the base concept and its one-level derived concepts are "gastritis" and "gastric cancer". The base concept and its one-level derived concepts form a one-level UCT. Thus the user concept tree contains "gastric disease", "gastritis" and "gastric cancer", which are treated as the user's query intentions.
Information retrieval in a grid
In order to apply the ontology to the concept-based grid information retrieval model, this research uses the ontology lookup service (OLS) ([7] Côté et al. , 2006) to extend a user's query to the user concept tree. The ontology lookup service provides a centralised and federated query interface for the ontology.
According to the concept set from the UCT, the concept-based information can be retrieved from distributed data sources. This research proposes four basic experimental scenarios to analyse the results and used the Globus Toolkit 4 package (see www.globus.org/) to implement the grid service of all sites. The Lucene API (see http://lucene.apache.org/) was used to implement the indexing function of all sites as data sources and was combined with the grid service to provide a remote searching function. Note that in this research 17,808 articles from the BioMed central corpus (see www.biomedcentral.com/) are the data source for experiments.
Ranking search results
For the purpose of evaluating the relevance between the retrieved documents and the user's query intentions, the weights of every concept in the user concept tree are calculated. This research gives different weighted power for the concepts at different levels from the base concept, based on the principle that the level of relevance of a derived concept is in inverse proportion to the distance from the base concept. Thus the weighted power of terms in the user concept tree is defined as shown in equation (1): Equation 1 [Figure omitted. See Article Image.] where a indicates a term index, α is a constant (0<α <1) and l is the number of the concept levels, and the weighted power of terms in the base concept is 1. In this research α is set at 0.5 and l is set at 1 for all experiments.
The proposed model retrieves relevant documents from different data sources, which may have different relevance for the user's query intentions. Thus it is necessary to integrate the search results from different data sources and rank all documents based on the similarity between each document and the user's query intentions under a global environment. This procedure consists of three phases as follows.
The calculation of relevance based on concept frequency
Assume that there are m documents, which include at least one term in the user's query intentions in the data source Rj . The term frequency (TF) in a document provides a useful measurement of word significance ([22] Luhn, 1957). That is, the word with a greater frequency has a stronger relationship with its associated document. Other variations are possible and the performance of them is dependent on the variants of data sets (e.g. [35] Sparck Jones, 1972; [29] Salton and Buckley, 1988; [47] Yang and Chute, 1994; [19] Lin, 1996). This research uses the term frequency technique, which is one of the well-known vector space models, as a proposed document vector representation approach. Thus terms in the user's query intentions are treated as the control vocabulary. Table I [Figure omitted. See Article Image.] shows the frequency of concept Cn , where TFn ,m is the frequency of concept Cn in document m .
This research represents a document through the vector space model as shown in equation (2) and thus the relevance score of document dj ,m in the data source Rj can be computed as shown in equation (3): Equation 2 [Figure omitted. See Article Image.] Equation 3 [Figure omitted. See Article Image.] where Wa is the weight of concept Ca in the concept set and TFa ,m is the frequency of concept Ca in document dj ,m . Then the total relevance of all documents containing concept Ca in the data source Rj is obtained as shown in equation (4): Equation 4 [Figure omitted. See Article Image.] where m is the total number of documents containing concept Ca .
The calculation of the relevance score between the concept set and all distributed data sources
In order to select appropriate data sources this research calculates the relevance score of all data sources by how likely they are to satisfy the concept set needed. This approach applies the modification of the collection retrieval inference (CORI) network database selection algorithm ([4] Callan et al. , 1995; [14] French et al. , 1999) for the data source selection. The CORI network algorithm is an extension of the Bayesian inference networks and it uses a variant of TF*IDF approaches to rank data sources.
The relevance scores of all data sources are affected by two scores, which are the inner score Tj and the outer score Ij . The calculation of the inner score Tj is shown in equation (5): Equation 5 [Figure omitted. See Article Image.] where total\; relevanced (j ) is the total relevance scores of all documents within the search results from each data source, k is the total number of indexing terms within the selected data source and avg k is the average value of indexing terms within all data sources.
The calculation of the outer score Ij is shown in equation (6): Equation 6 [Figure omitted. See Article Image.] where dsj is the total number of data sources which contain at least one concept in the concept set and dsJ is the total number of data sources in the grid environment. Finally the relevance score of the data source is computed as shown in equation (7): Equation 7 [Figure omitted. See Article Image.] where b is a parameter which is set at 0.5.
The calculation of the global relevance score of each document within search results
The relevance score between the concept set and all documents within the selected distributed data sources, and the relevance score between the concept set and all distributed data sources have been obtained. Then the global relevance score of each document for the query can be calculated as shown in equation (8): Equation 8 [Figure omitted. See Article Image.] where N is the total number of search results and avg\; relevanceds (J ) is the average value of all relevance scores of all data sources. According to the global relevance score of each document, the search results can be ranked by considering the relevance of the user concept tree and data resources at the same time.
In order to clarify the concept of this proposed approach, the following example is provided. A query term "gastric disease" is used as a base concept. According to the hierarchical structure in Figure 3 [Figure omitted. See Article Image.] its one-level derived concepts are "gastritis" and "gastric cancer". Therefore the total number of concepts is three. Based on equation (1), W1=gastric\; disease is 1, W2=gastritis is 0.5 and W3=gastric\; cancer is 0.5. Two data sources are used in this example. Each data source contains two documents. The detailed description of this example is shown in Table II [Figure omitted. See Article Image.]. Based on equation (3), relevanced (1,1) is 1+0.5×1=1.5, relevanced (1,2) is 0.5×2=1, relevanced (2,3) is 0.5×2=1, and relevanced (2,4) is 0.5×1=0.5. Based on equation (4), total\; relevanced (1) is 1.5+1=2.5 and total\; relevanced (2) is 1+0.5=1.5. The total number of indexing terms in data sources 1 and 2 are four and three, respectively. Thus, the average number of indexing terms is 3.5. Based on equation (5): Equation 9 [Figure omitted. See Article Image.] and: Equation 10 [Figure omitted. See Article Image.] I1 and I2 equal: Equation 11 [Figure omitted. See Article Image.] Based on equation (7), relevanceds (1) equals 0.5+0.5×0.011164×0.203114=0.501134 and relevanceds (2) equals 0.5+0.5×0.008330×0.203114=0.500850. Based on equation (8): Equation 12 [Figure omitted. See Article Image.] Equation 13 [Figure omitted. See Article Image.] Equation 14 [Figure omitted. See Article Image.] and: Equation 15 [Figure omitted. See Article Image.] Therefore the ranking list for the query term is {d1,1 , d1,2 , d2,3 , d2,4 } based on the relevant order from high to low.
Experimental design
To assess the proposed ontology-based grid information retrieval framework, the relevant information was retrieved by using non ontology-based and ontology-based approaches under centralised and distributed computational environments respectively. As a result there were four different prototype systems developed for comparison. The descriptions of the experimental platforms are as follows.
A non ontology-based and centralised model
A non ontology-based information retrieval model is a traditional information retrieval model that retrieves relevant information based on the keyword search mechanism. A centralised information retrieval model stores all data sources in a server. This non ontology-based and centralised information retrieval model is considered a benchmark. The architecture and description of the model are shown in Figure 4 [Figure omitted. See Article Image.] and Table III [Figure omitted. See Article Image.], respectively. All data sources are stored in Site 2, which contains 17,808 biomedical articles. Site 1 plays the role of a user interface for users to submit queries to Site 2, and Site 2 will return a relevant result list.
An ontology-based and centralised model
An ontology-based and centralised information retrieval model uses the ontology lookup service to compose the user concept set first and then takes this concept set as query expansion to retrieve the data source stored in a centralised server. When a user submits a query using the interface of Site 1, this query will be extended to a concept set as a new query for Site 2. This model uses the same platform as the non ontology-based and centralised model (Table III [Figure omitted. See Article Image.]). The architecture of this ontology-based model is shown in Figure 5 [Figure omitted. See Article Image.]. All data sources are stored in Site 2, which contains 17,808 biomedical articles from the BioMed central corpus.
A non ontology-based and distributed model
A non ontology-based and distributed model does not use OLS to extend the user's original query to a user concept set. Conversely this model uses the traditional keyword search mechanism in a distributed environment. In this model there are many data sources stored in different servers. This research implemented a grid service to integrate all distributed servers by using the Globus Toolkit 4 package. There are four sites in the distributed model used in this research. The architecture of this distributed model is shown in Figure 6 [Figure omitted. See Article Image.], and the sites in the model are described in Table IV [Figure omitted. See Article Image.]. Site 1 acts as a mediator and is responsible for interconnecting with other sites. Other sites store 6,044, 5,842, and 5,922, articles respectively. The Lucene API was used to implement the indexing function of all sites. Site 1 submits the user's queries to all data sources and all data sources return matched result lists. Then the CORI-based algorithm is used for ranking search results from different data sources globally.
An ontology-based and distributed model
The proposed model is an ontology-based and distributed model, which uses the ontology lookup service to compose the user concept set first and then takes this concept set as query expansion to retrieve relevant information from multiple data sources in a distributed environment. This model uses the same platform as the non ontology-based and distributed model (Table IV [Figure omitted. See Article Image.]). The architecture of this ontology-based model is shown in Figure 7 [Figure omitted. See Article Image.]. The grid service of all sites was built using the Globus Toolkit 4 package. Site 1 is a mediator and is responsible for interconnecting with other sites. Initially Site 1 submits the user's queries to OLS and obtains the concept set. Then Site 1 sends the concept set to all data sources and all data sources return matched result lists. This research used the Lucene API to implement the indexing function of all sites. For ranking the search results from different data sources, this research used the CORI-based algorithm, which ranks them globally.
Experimental results
In the field of information retrieval there are two kinds of evaluation approach: system-centred and user-centred ([42] Tsai et al. , 2006). The system-centred assessment measures, such as recall and precision, depend on a pre-labelled dataset. These measures are designed and used by researchers but not by users ([36] Spink, 2002). The same search results for a query may satisfy one user but may not satisfy another ([27] Ranwez et al. , 2010). It is impractical to build a pre-labelled dataset for users. Furthermore, manual pre-labelling of a real world dataset is very inefficient. Thus these system-centred measures may not capture the complete picture of user performance ([16] Hersh et al. , 2001).
However an information retrieval model should focus on measuring the satisfaction of a user's information needs ([37] Spink and Wilson, 1999). Only users can define the relevant information for their queries. The user-centred assessment measure, i.e. information satisfaction of users, presents a qualitative evaluation approach, which is considered important and also used for information retrieval models ([3] Belkin et al. , 2001; [16] Hersh et al. , 2001; [36] Spink, 2002; [42] Tsai et al. , 2006). By comparison with information satisfaction, the time measure is a straightforward evaluation criterion. [36] Spink (2002) argued that an information retrieval measure should account for the element of time in information seeking behaviour. Thus the research explores a user-centred approach and the time measure for the evaluation of four experimental platforms.
User judgments
The user-centred evaluation approach for the information retrieval model is necessary from the viewpoint of users (e.g. [36] Spink, 2002; [42] Tsai et al. , 2006). In this research the user-centred evaluation approach, i.e. the user judgment, is considered in order to evaluate the precision of document ranking between ontology-based and non ontology-based distributed information retrieval models. Specifically, the top ten documents of the search results are selected from these models. As user judgment is a subjective evaluation criterion, a questionnaire was designed in this research for users to evaluate the criteria of relevance and irrelevance based on a Likert 5-point scale, where 5 indicates full relevance and 1 indicates non-relevance. The search results from non ontology-based and ontology-based distributed information retrieval models were randomly sorted for the users to evaluate the relevance between the search results and the query keywords. The paired t -test was used to examine the level of significance of these two models. Since the search targets are biomedical articles, 16 doctors who work for Chung Shan Medical University Hospital in Taiwan were involved as the participants for user judgments. These 16 questionnaires collected in June 2007 were all valid.
Since an information retrieval model can perform differently on different queries, four keywords were evaluated:
"amine";
"dopamine";
"antioxidant"; and
"lyase".
Table V [Figure omitted. See Article Image.] shows the average relevance score of all documents from different models. The average relevance values for the ontology-based model and non ontology-based model are 3.663 and 2.425. respectively. Three out of four queries achieved a significant level of difference based on the significance value of 0.05. The experimental results show that the proposed ontology-based information retrieval model is able to select the top ten documents with a higher relevance score than the non ontology-based information retrieval model and assist users in obtaining the documents in accordance with their query intentions.
Processing time
All the computers used in this research were dedicated computers. The total processing time is divided into four stages, which are the processing times of concept set generation, network transmission, data source searching and document ranking. The description of these four stages of processing time is shown in Table VI [Figure omitted. See Article Image.].
Therefore the processing time for the non ontology-based and centralised model (TN_C ) is shown as equation (9). Equations (10), (11) and (12) are the processing times for the ontology-based and centralised model (TO_C ), the non ontology-based and distributed model (TN_D ), and the ontology-based and distributed model (TO_D ), respectively: Equation 16 [Figure omitted. See Article Image.] Equation 17 [Figure omitted. See Article Image.] Equation 18 [Figure omitted. See Article Image.] Equation 19 [Figure omitted. See Article Image.] According to the experimental results in Table VII [Figure omitted. See Article Image.] the distributed information retrieval models take much less time than the centralised information retrieval models for all queries, and the ontology-based information retrieval models take more time than the non ontology-based information retrieval models. It is interesting to note that the processing times between the two non ontology-based models (i.e. TN_C , TN_D ) and the proposed approach (i.e. TO_D ) does not differ greatly. Building an information retrieval model in a distributed environment is a suitable strategy for the ontology-based information retrieval model.
Conclusions
The ontology provides semantic concepts and can be used for information retrieval. It has limitations in resource sharing and mass information processing ([50] Zheng et al. , 2011). However, a grid has benefits for sharing resources and processing massive amounts of information in a distributed environment but cannot handle the conceptual heterogeneity without integration of semantic information ([50] Zheng et al. , 2011). Although some semantic grid models have been proposed ([21] Liu and Xiao, 2009; [49] Zhang and Yin, 2010; [48] Zhang and Ting, 2010; [50] Zheng et al. , 2011), they have only focused on construction of the semantic grid, ignoring the significance of evaluation from the viewpoint of users.
This research proposed a concept-based information retrieval mechanism in the grid environment and used the concept set in the ontology to capture the user's query intentions as a means of query expansion for searching relevant documents over many distributed data sources. By defining key concepts of a specific domain and relationships between concepts, the ontology allows semantic inference and enriches the semantic expressiveness for both indexing and querying data sources. As the degree of significance of distributed data sources for a specific query may be different, this research re-ranks documents from many distributed data sources based on the collection retrieval inference algorithm. According to the experiments this ontology-based mechanism requires more computing resources than the non ontology-based mechanism, evaluated on the criterion of processing time. Table VII [Figure omitted. See Article Image.] shows that this situation can be alleviated by the integration of the grid computing technique in a distributed environment. An information retrieval model should also be evaluated from the viewpoint of the user's information needs ([37] Spink and Wilson, 1999). The experiments showed that the average relevance of the ontology-based model is greater than that of the non ontology-based model. Thus this proposed approach has demonstrated that the ontology-based information retrieval mechanism is capable of calculating the global relevance score of all documents in a distributed environment and displays those documents with higher relevance to users.
For future work several issues can be considered. First the ontology is usually domain-dependent. Apart from the ontologies in the biomedical domain, which have been used in daily life, there are not many ontologies in other domains in daily use. Automatic construction of domain ontologies by some inductive learning techniques from text mining may be an interesting area for investigation. Second the concept-based resources can be obtained from the integration of ConceptNet ([20] Liu and Singh, 2004) and WordNet ([25] Miller, 1995). ConceptNet is a large-scale commonsense knowledge base, which supports many practical textual-reasoning tasks over real-world documents. WordNet is an online dictionary and is treated as a general domain ontology. The concept tree built from the integration of ConceptNet and WordNet can be used for documents in daily life. Third the scale of the experimental work should be extended. More queries and more documents should be used in future research. Finally more evaluation criteria, such as time-cost analysis and F -measures should be considered.
1. Aloisio, G., Cafaro, M., Epicoco, I., Fiore, S. and Mirto, M. (2005), "A semantic grid-based data access and integration service for bioinformatics", Proceedings of 2005 IEEE International Symposium on Cluster Computing and the Grid, IEEE Computer Society Press, Los Alamitos, CA, pp. 196-203.
2. Basirat, A.H. and Khan, A.I. (2010), "Evolution of information retrieval in cloud computing by redesigning data management architecture from a scalable associative computing perspective", Neural Information Processing, Models and Applications, Lecture Notes in Computer Science, Vol. 6444, Springer, Berlin, pp. 275-82.
3. Belkin, N.J., Cool, C., Kelly, D., Lin, S.-J., Park, S.Y., Perez-Carballo, J. and Sikora, C. (2001), "Iterative exploration, design and evaluation of support for query reformulation in interactive information retrieval", Information Processing & Management, Vol. 37 No. 3, pp. 403-34.
4. Callan, J.P., Lu, Z. and Croft, W.B. (1995), "Searching distributed collections with inference networks", Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, New York, NY, pp. 21-8.
5. Castells, P., Perdrix, F., Pulido, E., Rico, M., Benjamins, V.R., Contreras, J. and Lorés, J. (2004), "Neptuno: semantic web technologies for a digital newspaper archive", The Semantic Web: Research and Applications, Lecture Notes in Computer Science, Vol. 3053, Springer, Berlin, pp. 445-58.
6. Corby, O., Dieng-Kuntz, R., Gandon, F. and Faron-Zucker, C. (2006), "Searching the semantic web: approximate query processing based on ontologies", IEEE Intelligent Systems, Vol. 21 No. 1, pp. 20-7.
7. Côté, R.G., Jones, P., Apweiler, R. and Hermjakob, H. (2006), "The ontology lookup service, a lightweight cross-platform tool for controlled vocabulary queries", BMC Bioinformatics, Vol. 7 No. 97, available at: www.biomedcentral.com/1471-2105/7/97 (accessed 27 April 2012).
8. de Roure, D., Jemmings, N.R. and Shadbolt, N.R. (2001), Research Agenda for the Semantic Grid: A Future e-Science Infrastructure, National e-Science Centre, Edinburgh.
9. de Roure, D., Jemmings, N.R. and Shadbolt, N.R. (2005), "The semantic grid: past, present, and future", Proceedings of the IEEE, Vol. 93 No. 3, pp. 669-81.
10. Ferrara, A., Ludovico, L.A., Montanelli, S., Castano, S. and Haus, G. (2006), "A semantic web ontology for context-based classification and retrieval of music resources", ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 2 No. 3, pp. 177-98.
11. Foster, I. and Kesselman, C. (2004), The Grid: Blueprint for a New Computing Infrastructure, 2nd ed., Morgan Kaufman, San Francisco, CA.
12. Foster, I., Kesselman, C. and Tuecke, S. (2001), "The anatomy of the grid: enabling scalable virtual organizations", International Journal of Supercomputer Applications, Vol. 15 No. 3, pp. 200-22.
13. Foster, I., Zhao, Y., Raicu, I. and Lu, S. (2008), "Cloud computing and grid computing 360-degree compared", Proceedings of Grid Computing Environments Workshop (GCE'08), IEEE Press, Washington, DC, available at: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4738445 (accessed 27 April 2012).
14. French, J.C., Powell, A.L., Callan, J., Viles, C.L., Emmitt, T., Prey, K.J. and Mou, Y. (1999), "Comparing the performance of database selection algorithms", Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, New York, NY, pp. 238-45.
15. Haav, H.-M. and Lubi, T.-L. (2001), "A survey of concept-based information retrieval tools on the web", Proceedings of the 5th East-European Conference, ADBIS 2001, BibSonomy Press, Knowledge and Data Engineering Group of the University of Kassel, Kassel, pp. 29-41.
16. Hersh, W., Turpin, A., Price, S., Kraemer, D., Olson, D., Chan, B. and Sacherek, L. (2001), "Challenging conventional assumptions of automated information retrieval with real users: Boolean searching and batch retrieval evaluations", Information Processing & Management, Vol. 37 No. 3, pp. 383-402.
17. Johanstona, W.E. (2002), "Computational and data grids in large-scale science and engineering", Future Generation Computer Systems, Vol. 18 No. 8, pp. 1085-100.
18. Lim, S.C.J., Liu, Y. and Lee, W.B. (2009), "Faceted search and retrieval based on semantically annotated product family ontology", Proceedings of the WSDM '09 Workshop on Exploiting Semantic Annotations in Information Retrieval, ACM Press, New York, NY, pp. 15-24.
19. Lin, X. (1996), "Graphical table of contents", Proceedings of the ACM Conference on Digital Libraries, ACM Press, New York, NY, pp. 45-53.
20. Liu, H. and Singh, P. (2004), "ConceptNet - a practical commonsense reasoning tool-kit", BT Technology Journal, Vol. 22 No. 4, pp. 211-26.
21. Liu, C. and Xiao, H. (2009), "A study on semantic grid information services oriented regional library consortia", Proceedings of the 2009 2nd International Symposium on Computational Intelligence and Design, IEEE Press, Washington, DC, pp. 505-8.
22. Luhn, H.P. (1957), "A statistical approach to the mechanized encoding and searching of literary information", IBM Journal of Research and Development, Vol. 1 No. 4, pp. 309-17.
23. Martins, D.S., Santana, L.H.Z., Biajiz, M., do Prado, A.F. and de Souza, W.L. (2008), "Context-aware information retrieval on a ubiquitous medical learning environment", Proceedings of the 2008 ACM Symposium on Applied Computing, ACM Press, New York, NY, pp. 2348-9.
24. Mayfield, J. and Finin, T. (2003), "Information retrieval on the semantic web: integrating inference and retrieval", Proceedings of Workshop Semantic Web, the 26th International ACM SIGIR Conference, ACM Press, New York, NY, available at: http://ebiquity.umbc.edu/_file_directory_/papers/110.pdf (accessed 27 April 2012).
25. Miller, G.A. (1995), "WordNet: a lexical database for English", Communications of the ACM, Vol. 38 No. 11, pp. 39-41.
26. Nguyen, C.Q. and Phan, T.T. (2009), "An ontology-based approach for key phrase extraction", Proceedings of the ACL-IJCNLP 2009 Conference, Association for Computational Linguistics Press, Stroudsburg, PA, pp. 181-4.
27. Ranwez, S., Ranwez, V., Sy, M.-F., Montmain, J. and Crampes, M. (2010), "User centered and ontology based information retrieval system for life sciences", Proceedings of the Workshop on Semantic Web Applications and Tools for Life Sciences, BMC Press, London, available at: www.biomedcentral.com/1471-2105/13/S1/S4/ (accessed 27 April 2012).
28. Rinaldi, A.M. (2009), "An ontology-driven approach for semantic information retrieval on the web", ACM Transactions on Internet Technology, Vol. 9 No. 3, available at: http://dl.acm.org/citation.cfm?id=1552293 (accessed 27 April 2012).
29. Salton, G. and Buckley, C. (1988), "Term-weighting approaches in automatic text retrieval", Information Processing & Management, Vol. 24 No. 5, pp. 513-23.
30. Segev, A., Leshno, M. and Zviran, M. (2007), "Context recognition using internet as knowledge base", Journal of Intelligent Information Systems, Vol. 29 No. 3, pp. 305-27.
31. Seo, J. and Croft, W.B. (2008), "Blog site search using resource selection", Proceedings of the 17th ACM Conference on Information and Knowledge Management, ACM Press, New York, NY, pp. 1053-62.
32. Shah, U., Finin, T., Joshi, A., Cost, R.S. and Mayfield, J. (2002), "Information retrieval on the semantic web", Proceedings of the ACM CIKM International Conference on Information and Knowledge Management (CIKM 2002), ACM Press, New York, NY, pp. 461-8.
33. Shi, S., Yang, G. and Wang, D. (2002), "Study on topologies of information grid", Proceedings of the Fifth International Conference on Algorithms and Architectures for Parallel Processing, IEEE Press, Washington, DC, pp. 426-9.
34. Sieg, A., Mobasher, B. and Burke, R. (2007), "Web search personalization with ontological user profiles", Proceedings of the 6th ACM Conference on Information and Knowledge Management, CIKM'07, ACM Press, New York, NY, pp. 523-33.
35. Sparck Jones, K. (1972), "A statistical interpretation of term specificity and its application in retrieval", Journal of Documentation, Vol. 28 No. 1, pp. 11-21.
36. Spink, A. (2002), "A user-centered approach to evaluating human interaction with web search engines: an exploratory study", Information Processing & Management, Vol. 38 No. 3, pp. 401-26.
37. Spink, A. and Wilson, T.D. (1999), "Toward a theoretical framework for information retrieval (IR) evaluation in an application", Proceedings of MIRA 99: Evaluation Frameworks for Multimedia Information Retrieval Applications, BCS Press, London, pp. 75-92.
38. Stanoevska-Slabeva, K., Wozniak, T. and Ristol, S. (2010), Grid and Cloud Computing: A Business Perspective on Technology and Applications, Springer, Berlin.
39. Studer, R., Benjamins, V.R. and Fensel, D. (1998), "Knowledge engineering: principles and methods", IEEE Transactions on Data and Knowledge Engineering, Vol. 25 Nos 1/2, pp. 161-99.
40. Sun, H., Li, S., Li, W., Ming, Z. and Cai, S. (2005), "Semantic-based retrieval of remote sensing images in a grid environment", IEEE Geoscience and Remote Sensing Letters, Vol. 2 No. 4, pp. 440-4.
41. Trnkoczy, J., Turk, Z. and Stankovski, V. (2006), "A grid-based architecture for personalized federation of digital libraries", Library Collections, Acquisitions, & Technical Services, Vol. 30 Nos 3/4, pp. 139-53.
42. Tsai, C.-F., McGarry, K. and Tait, J. (2006), "Qualitative evaluation of automatic assignment of keywords to images", Information Processing and Management, Vol. 42 No. 1, pp. 136-54.
43. van der Vet, P.E. (1998), "Bottom-up construction of ontologies", IEEE Transactions on Knowledge and Data Engineering, Vol. 10 No. 4, pp. 513-26.
44. Vateekul, P. and Rungsawang, A. (2004), "DWORM - a distributed text retrieval prototype on grid environment", Proceedings of the IEEE International Symposium on Communications and Information Technologies, IEEE Computer Society Press, Los Alamitos, CA, pp. 222-7.
45. Vega-Gorgojo, G., Bote-Lorenzo, M.L., Gomez-Sanchez, E., Dimitriadis, Y.A. and Asensio-Perez, J.I. (2005), "Semantic search of learning services in a grid-based collaborative system", Proceedings of Cluster Computing and the Grid, CCGrid, IEEE International Symposium 1, IEEE Computer Society Press, Los Alamitos, CA, pp. 19-26.
46. Wenjie, L., Zhiyong, F., Yong, L. and Zhoujun, X. (2004), "Ontology based intelligent information retrieval system", Proceedings of Electrical and Computer Engineering, Canadian Conference, IEEE Press, Washington, DC, pp. 373-6.
47. Yang, Y. and Chute, C.G. (1994), "An example-based mapping method for text categorization and retrieval", ACM Transactions on Information Systems, Vol. 12 No. 3, pp. 252-77.
48. Zhang, J. and Ting, Y. (2010), "Research of retrieving model for digital library based on semantic grid", Proceedings of the 2010 International Forum on Information Technology and Applications, IEEE Press, Washington, DC, pp. 431-4.
49. Zhang, J. and Yin, Q. (2010), "Research of digital library architecture based on semantic grid", Proceedings of the 2010 2nd International Symposium on Information Engineering and Electronic Commerce (IEEC), IEEE Press, Washington, DC, available at: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=5533302 (accessed 27 April 2012).
50. Zheng, Z.-Y., Wang, Z.-F., Li, L. and Zhao, J.-X. (2011), "An integration of heterogeneous resources based on semantic grid", Proceedings of the 2011 Second Informational Conference on Networking and Distributed Computing, IEEE Press, Washington, DC, pp. 88-92.
About the authors
Dr Chihli Hung is an Associate Professor at the Department of Information Management at Chung Yuan Christian University, Taiwan. His current research interests are in digital libraries, intelligence systems, computational intelligence, natural language processing, and data mining. He obtained a MSc (with Distinction) and a PhD in Computer Science from the University of Sunderland, UK in 1994 and 2004, respectively. Chihli Hung is the corresponding author and can be contacted at: [email protected]
Dr Chih-Fong Tsai is an Associate Professor at the Department of Information Management at National Central University, Taiwan. He obtained a PhD at School of Computing and Technology from the University of Sunderland, UK in 2005. He has published over 20 refereed journal papers. In 2008, he received the "Highly Commended Award" (Emerald Literati Network 2008 Awards for Excellence) for a paper published in Online Information Review ("A review of image retrieval methods for digital cultural heritage resources"). His current research focuses on multimedia information retrieval and data mining applications.
Dr Shin-Yuan Hung is a Professor at the Department of Information Management at National Chung Cheng University, Taiwan. He obtained a PhD in Management Information Systems from National Sun Yat-sen University, Taiwan.
Chang-Jiang Ku obtained a Master in Management Information Systems from National Chung Cheng University, Taiwan in 2007.
Chihli Hung, Department of Information Management, Chung Yuan Christian University, Chung-Li, Taiwan
Chih-Fong Tsai, Department of Information Management, National Central University, Chung-Li, Taiwan
Shin-Yuan Hung, Department of Information Management, National Chung Cheng University, Minhsiung, Taiwan
Chang-Jiang Ku, Department of Information Management, National Chung Cheng University, Minhsiung, Taiwan
Equation 1
Equation 2
Equation 3
Equation 4
Equation 5
Equation 6
Equation 7
Equation 8
Equation 9
Equation 10
Equation 11
Equation 12
Equation 13
Equation 14
Equation 15
Equation 16
Equation 17
Equation 18
Equation 19
Figure 1: The flat, hierarchical and mesh topologies for information grid
Figure 2: The ontology-based grid information retrieval (OGIR) framework
Figure 3: The generation of a one-level user concept tree
Figure 4: The architecture of a non ontology-based and centralised model
Figure 5: The architecture of an ontology-based and centralised model
Figure 6: The architecture of a distributed model
Figure 7: The architecture of an ontology-based and distributed model
Table I: An example of a document-concept matrix
Table II: A simple example of a document-concept matrix
Table III: The description of a centralised model
Table IV: The description of a distributed model
Table V: The statistical data of relevance scores of all documents from two kinds of model
Table VI: The description of processing time in different stages
Table VII: A comparison of processing time for TN_C , TO_C , TN_D and TO_D
Copyright Emerald Group Publishing Limited 2012
