Content area
Purpose
- Most digital libraries (DL) are now available online. They also provide the Z39.50 standard protocol which allows computer-based systems to effectively retrieve information stored in the DLs. The major difficulty lies in inconsistency between database schemas of multiple DLs. The purpose of this paper is to present a system known as Argumentation-based Digital Library Search (ADLSearch), which facilitates information retrieval across multiple DLs.
Design/methodology/approach
- The proposed approach is based on argumentation theory for schema matching reconciliation from multiple schema matching algorithms. In addition, a distributed architecture is proposed for the ADLSearch system for information retrieval from multiple DLs.
Findings
- Initial performance results are promising. First, schema matching can improve the retrieval performance on DLs, as compared to the baseline technique. Subsequently, argumentation-based retrieval can yield better matching accuracy and retrieval efficiency than individual schema matching algorithms.
Research limitations/implications
- The work discussed in this paper has been implemented as a prototype supporting scholarly retrieval from about 800 DLs over the world. However, due to complexity of argumentation algorithm, the process of adding new DLs to the system cannot be performed in a real-time manner.
Originality/value
- In this paper, an argumentation-based approach is proposed for reconciling the conflicts from multiple schema matching algorithms in the context of information retrieval from multiple DL. Moreover, the proposed approach can also be applied for similar applications which require automatic mapping from multiple database schemas.
Introduction
Unlike traditional means of storage, digital libraries (DL) (Saracevic and Dalbello, 2001) are a new kind of library that has emerged since the end of the twentieth century. In DL information and documents are stored in digital forms which can be accessed and retrieved over the web. Through the standard protocols such as Z39.50, search engines can also search information from different DL, and crawlers can connect directly to the database servers and access the data of the DL. Nowadays DL have become one of the major sources for researchers when finding scholarly information over the web.
Traditionally DL organise information in database schema. To support information retrieval from multiple DL it is commonly assumed that the databases of the different DL would have the same schemas. However, in practice each digital library will have its own schema. As shown in Figure 1 the same publication record may be represented differently in schemas when stored in different DL.
In Figure 2 we present a closer view of the problem of inconsistent concept representation in different schemas. When representing the concept of "academic paper", one schema may adopt the term Document while other schemas may use the term Publication. Some others may even split the concept into two sub-concepts such as Article and Publisher. It may be easy for humans to understand the similarity between these terms. However, the inconsistency of terms or keywords used to represent the same concept poses a serious problem for information retrieval from different sources of DL. This leads to a well-known research problem called schema matching.
Different algorithms have been proposed for automatic matching between schemas. However, as most algorithms rely mainly on heuristics to deal with the inconsistency of keywords, applying them to different data sets would lead to different or even conflicting, results (Nguyen et al. , 2012). In general each algorithm works well in certain domains, but its performance suffers when applied to other domains. Thus for the digital library domain, the difficulty lies in the fact that scholarly materials stored in DL are from different domains, ranging from social sciences to natural sciences. Hence to select a suitable one-size-fits-all matching algorithm is a very challenging task.
In this paper we propose to apply argumentation theory to tackle this problem. The idea here is that, instead of fixing a certain schema matching algorithm, we can try multiple matching strategies at the same time. Then if any conflict is found among the matching results, argumentation theory is applied to infer the most logical and appropriate answer.
This paper makes two main contributions. First, we propose an argumentation-based approach to perform schema matching from multiple DL. The argumentation framework has been published in our previous work (Nguyen et al. , 2013); however, this is the first time it has been applied to the digital library domain. Moreover, we also improve our argumentation framework to make it fully automatic, instead of relying on the involvement of human experts. Second, the proposed approach is then incorporated into a search system for DL, called Argumentation-based Digital Library Search (ADLSearch). To the best of our knowledge, up to now the matching between multiple DL has mainly involved manual methods. In contrast the ADLSearch system is capable of handling more than 800 DL in an automatic manner due to the integration of our extended argumentation framework.
Related work
Classical schema matching algorithms
Schema matching has been recognised as one of the most important operations required by the process of data integration, which has been studied by the database and AI communities for over 25 years (Doan and Halevy, 2005). There are many cutting-edge schema matching techniques and tools (Bernstein et al. , 2011), such as element-level matching, structure-level matching, instance-based matching and combined techniques. Classical and recent tools developed alongside this direction are discussed in detail by Nguyen et al. (2012), notably including Bmatch (Duchateau et al. , 2007), COMA++ (Aumueller et al. , 2005), ASMOV (Jean-Mary et al. , 2009), Falcon-AO (Gonzalez et al. , 2010), AgreementMaker (Marie and Gal, 2007), OII Harmony (Melnik et al. , 2002), Auto Mapping Core (AMC) (Peukert et al. , 2011), Ontobuilder (Roitman and Gal, 2006), etc. Most systems focus on semi-structure schema types (e.g. XML, OWL and RDF), in order to be aligned with current business standards (Kabak and Dogac, 2010).
These tools thus introduced various approaches to capture similarities between schemas, including linguistic processing (dictionary lookup, string matching, etc.), structure-based analysis or tuning selection methods. However, the outputs of these methods are still inherently uncertain, as a lot of irrelevant items and mismatches were found when applying these methods to real-life data sets.
Schema matching of big data on the web
As the amount of data shared over the world wide web keeps growing dramatically, schema matching for structured data on the web, especially ontological data used by semantic technologies, is equally attracting considerable attention. Schema matching is considered one of the four challenges of Big Data processing, known as Orri's Challenge (Bizer et al. , 2012).
To tackle this problem, increasing the performance of schema matching by using linked data such as Wikipedia has been considered (Assaf et al. , 2012). However, this method would suffer from performance issues when dealing with real data where the linkage between elements/entities is very large. Crowdsourcing (Doan et al. , 2011), where the major ideas of communities are taken into account and analysed to eventually infer the most logical ones, is a noteworthy approach. However, building a reliable community is another real challenge.
Applying classic schema matching algorithms to big data, especially in the context of the semantic web, was recently discussed (Pinkel et al. , 2013). However, the same problem persists when different algorithms are applied. The most recent work (Dong and Srivastava, 2014) suggested a model for data integration in big data, which is a twofold process: constructing a mediated global schema, and generating the mappings between the mediated (global) schema and the local schemas. This approach is also our proposal for schema matching for DL, where argumentation is adopted for the second step of mapping generation.
Schema matching for multiple DL
Different DL have been proposed and developed. For example JeromeDL (Kruk, 2010) is an open source semantic digital library. CDS Invenio (http://invenio-software.org) is another open source digital library with approximately one million documents in 700 collections of different categories. Papadakis et al. (2009) proposed a subject-based digital library, whereas Cinque et al. (2004), Bloehdorn et al. (2007) and Quan et al. (2007) proposed ontology-based DL.
Supporting information search from multiple DL is an emerging research area. The ICDL project (Hutchinson et al. , 2005) aimed to organise the indexes and search information from several DL located in different countries. ANTAEUS (Joint, 2010) introduced an amalgamated search engine which searches information sources gathered from multiple DL. Chen et al. (2011) developed CollabSeer to search information on researchers' publications stored in DL for recommending suitable candidates for research projects.
However, in order to support scholarly retrieval from multiple DL, the issue of schema matching is undeniable. Schema matching in DL can be considered a specific case of big data schema matching where the stored data is structured scholarly information. In addition standardised protocols for DL such as Z39.50 and MARC-21 can support information retrieval from multiple DL. A web data integration approach, in which schema matching plays a crucial role, was proposed by Belhajjame et al. (2011) and Bernstein et al. (2011). However, due to the unresolved problem of inconsistency between schema matching algorithms, most of the methods for data integration from multiple DL are still manual in practice (Song et al. , 2005; Kent and Bowman, 2011).
Unlike classic schema matching algorithms, COSM (Song et al. , 2005) is a clustering-based approach which aims to infer matching from element-based clustering results from DL' data. However, applying clustering to large-scale data still requires data pre-processing steps. Content-based systems, such as SIMPLIcity (Chen and Wang, 2002) or ETANA (Ravindranathan et al. , 2004) for multimedia retrieval from DL, also take a noteworthy approach, as they try to extract semantic information from the contents of the materials stored in the DLs, rather than processing at the schema layer. However, attempts to automate this process using machine learning algorithms are still encountering considerable difficulty due to the complexity of dealing with large volumes of data (Shvaiko and Euzenat, 2013). As a result information retrieval from multiple DL with various data schemas is still taking a manual approach such as the Nebula interface for constructing conceptual knowledge systems for DLs (Kent and Bowman, 2011).
Applications of argumentation-based approaches
The argumentation-based approach, in which matching decisions are formulated as arguments, is a kind of propositional logic supporting reasoning and reconciliation from n -parties games (Phan, 1995). This work then evolved to argumentation theory, which is a systematic study of techniques to reach conclusions from given premises (Besnard and Hunter, 2008). Based on the arguments we can detect the conflicts between arguments and support the selection of the most reasonable arguments to resolve the conflicts.
There are two kinds of argumentation approach: abstract argumentation and logical argumentation (Prakken, 2012). The former was proposed by Dung (1995), who described arguments as abstract objects. Dung (1995 , 2007 also introduced the concept of acceptability semantics, which defined different levels of acceptance for a proposed argument.
However, the most prominent proposal in this area is logical argumentation (Besnard and Hunter, 2008) which was adopted in this research. This approach relies on propositional logic to describe the arguments. The theoretical details and running example of applying logical argumentation for schema matching will be presented in the next section.
The argumentation-based approach has been successfully applied to many practical applications. Bentahar et al. (2010) used argumentation for solving conflicts that may arise among web services and resources in business processes of e-commerce systems. In collaborative and cooperative planning (Sapena et al. , 2011) argumentation can be combined with machine learning to improve the automation level of operations policies. In social networks (Grosse et al. , 2012) natural language processing is adopted to extract arguments from textual data, which are used to make social agreements among participants. In cloud-computing (Heras et al. , 2012) argumentation can be used to help cloud providers handle physical failures in a collaborative manner. In the semantic web (Rahwan et al. , 2007) argumentation has been modelled using Argument Interchange Format ontology, allowing large-scale collection of interconnected arguments on the web.
Motivation for this research from existing work
Schema matching is a technique which aims at reasonably matching elements from different schemas. Thus this technique plays a crucial role in data integration from various sources, especially from those available on the internet. Many classic schema matching algorithms have been proposed, each of which achieved better accuracy when applied to certain domains of data. However, to identify which algorithm is the best for a given data set is an important task which still remains unsolved.
With the recent emerging trends of big data and semantic technologies, schema matching is one of the four major challenges of performing data integration from multiple databases/ontologies. One of the works in this field suggested the usage of a central schema which the element matchings will be centred around. We adopt this idea to integrate scholarly data from multiple DL.
So far data integration of multiple DL has still relied heavily on manual methods. We propose to use the argumentation technique to automate this process as this method can yield reasonable combinations from matching results. However, the existing approach of argumentation still requires human intervention from experts to approve or disapprove each matching produced. We overcome this obstacle by using empirical thresholds to replace human decisions, which is discussed in the subsequent sections.
Argumentation-based conflict reconciliation of schema matching results
In this section we describe using the argumentation-based approach for conflict reconciliation of schema matching results. Currently several schema matching algorithms are available. However, thus far no algorithm has been shown to be better than the others. Moreover, conflicts can arise from matching results produced by these algorithms. In previous work we proposed an argumentation-based framework to handle this problem (Nguyen et al. , 2013). In this work the framework is adopted and extended to support automatic schema matching for DL.
As shown in Figure 3 the framework consists of two phases: individual validation and conflict reconciliation.
Individual validation involves two steps. The first step is individual matching, which involves several matching algorithms. The mappings outputted by the matching algorithms will be integrated in the schema mapping table. The second step is argument construction which will then convert the stored mappings into a mathematical representation - the argument - for further processing. The arguments will be stored in the arguments set.
The conflict reconciliation phase reconciles the mapping conflicts. It comprises the following tasks:
Conflict detection: as the mappings are converted into arguments in the first phase, we process the arguments to detect any conflicts among them mathematically.
Argument evaluation: when a conflict between arguments is detected, the involved arguments will be evaluated to determine their strengths.
Guided resolution: based on the strength of the arguments a final resolution will be inferred, guided from some computations over the argument strengths. The resolution will imply which mapping should be retained or removed to resolve the conflict.
Individual validation
In this phase a number of matching algorithms will be employed to generate a mapping between database schemas. A common characteristic of most matching algorithms is that a mapping is evaluated by a score, which can be easily normalised uniformly across all of the algorithms. If an algorithm A evaluates a mapping m by a score S greater than an upper threshold T u , we say that A approves m . Otherwise, if S is less than a lower threshold T l , we say that A disapproves m . T u and T l can be determined empirically.
Figure 4 illustrates an example in which some mappings between three schemas S 1 , S 2 and S 3 are generated by three algorithms called Algorithm 1, Algorithm 2 and Algorithm 3. Table I presents the corresponding schema mapping table capturing information on these mappings.
It can easily be observed that there are some prominent conflicts occurring in the mappings. For example the attribute S 1 . ReleaseDate is matched with two distinct attributes S 3 . ProductionDate and S 3 . AvailabilityDate of schema S 3 (mapping c 4 and c 2 ). There is another conflict which is more complex: S 3 . AvailabilityDate is matched to S 2 . ScreeningDate (mapping c 3 ), which is then matched to S 1 . ReleaseDate (mapping c 1 ) and finishes at S 3 . ProductionDate (mapping c 4 ). However, using such mappings for data integration can lead to unwanted effects: S 3 . AvailibilityDate and S 3 . ProductionDate would be linked even though they are two distinct attributes in the same schema.
Argument construction
To detect and handle these conflicts, we first generate arguments from the schema mapping table. In general an argument can be represented in the form of {< support >; claim } implying that the argument makes a claim which is supported by the facts. In argumentation theory both the support and the claim are logical formulae. For example from the fact that Algorithm 1 approves mapping c 1 , we can generate the argument a 1 ={< c 1 >, c 1 } where the claim is that c 1 is correct (i.e. S 1 . ReleaseDate is the same as S 2 . ScreeningDate ). This claim is based on the simple support that the algorithm has already approved c 1 . A more complex example is the argument a 4 ={< c 2 ,¬ c 2 ⋁¬ c 4 >, ¬ c 4 }, which can be interpreted as follows. The claim of this argument is that c 4 is not correct. This claim is supported by two facts: c 2 is approved, and c 2 and c 4 cannot both be correct (¬ c 2 ⋁¬ c 4 ) at the same time.
Table II depicts the process of generating arguments from the schema mapping table in both logic and verbal descriptions. For the details on how to generate arguments, especially by mathematical deduction, see Besnard and Hunter (2008).
Conflicts detection
The representation of arguments can be used to interpret more precisely the direct and indirect conflicts between them. If the claims of two arguments w 1 and w 2 contradict each other, then we say that the arguments w 1 and w 2 are in direct conflict. If the claim of an argument w 2 appears in a negated form in the support of w 1 , then it is referred to as an indirect conflict. For example arguments a 4 and b 2 in Table II pose a direct conflict since their claims contradict each other.
Since arguments are represented as logic formulae, mathematical proofs (Chang and Lee, 1973) can be used to detect conflicts between them in an automatic manner.
Argument evaluation
When a conflict is detected between two arguments, the conflict can be resolved by removing an unreasonable argument and keeping the reasonable one. To justify whether an argument is reasonable, it is natural to evaluate the argument as a numerical value, known as the strength of the argument.
In this research each argument is evaluated by a score in the range [0,1] which is called the acceptance ratio of that argument (Phan, 1995). Egly et al. (2010) provide methods to compute acceptance ratios whose complexities are theoretically high. Here we develop a method, known as a defence graph, which relies on the defence analysis between arguments. An argument w 1 is said to be defended by w 2 if the claim of w 2 appears in the support of w 1 . In other words the claim of w 2 makes the claim of w 1 more reliable. For example in Table II a 4 is defended by a 2 since the claim of a 2 (which is c 2 ) appears in the support of a 4 . Figure 5 presents the defence graph of arguments given in Table II.
Based on a defence graph the strength of an argument w can be evaluated as:
(Equation 1)
where n d is the number of arguments that defend w and N is the total number of arguments. We increase the value of the numerator by 1 to imply that by default an argument is always defended by itself.
For example we have strength ( a 5 )=4/10=0.4, strength ( a 6 )=0.3, strength ( a 4 )=0.2, strength ( b 2 )=0.2 and strength ( k 2 )=0.2. The justification behind these evaluated values is that the more reasonable an argument is, the more arguments it is defended by (causing this argument to have higher strength).
Guided resolution
After being evaluated, arguments supporting/opposing the same mappings are aggregated and form pairs of conflicting mapping decisions. From the evaluation values of arguments, we apply aggregate operators to compute the score of the mappings.
Figure 6 illustrates the mapping evaluation. In the example given in Table II we have three arguments a 4 , a 5 and a 6 claiming that the mapping c 4 should be disapproved, with respective scores of 0.2, 0.4 and 0.3.
Figure 7 depicts the conflict resolution process between approving ( c 4 ) and disapproving (¬ c 4 ). The disapproval decision (¬ c 4 ) is derived from arguments a 4 , a 5 and a 6 , while the approval comes from arguments b 2 and k 2 . Assuming that the SUM operator is applied, the scores of c 4 and ¬ c 4 are 0.4 and 0.9, respectively. These values obviously hint that we should follow the disapproval decision (¬ c 4 ) and discard the approval.
Extension of the reconciliation framework in this work
Compared to our previous work (Nguyen et al. , 2013), the framework which has been discussed is extended in the following ways:
In the previous work we relied on human experts to approve or disapprove a mapping. In this work this step is automated by using upper and lower scores. Thus the reconciliation framework is scalable for a large number of schemas available for various DL.
We suggest using a defence graph to calculate argument strength. Thus the complexity of this step is reduced significantly, as compared to the logic-based approach introduced in the previous work.
The ADLSearch system
Figure 8 shows our proposed ADLSearch system, which is a search engine designed for searching scholarly information from multiple DL over the internet. One can observe that the architecture of ADLSearch comprises the major components of a typical search engine including crawling, retrieving and an indexed data layer. In particular the system is enhanced by the argumentation-based conflict reconciliation framework, which has just been discussed. This component is incorporated for handling conflicts when mapping schemas between multiple DL.
The main function of the crawling component is to crawl scholarly information from multiple DL on the internet. DL usually offer a web-based graphical user interface (GUI) allowing general users to search for information in a convenient manner. Apart from that, information from DL can also be automatically retrieved through the Z39.50 protocol. As such, the crawling component can retrieve information from a specific digital library. There are two types of information to be crawled: the schema of the scholarly information organised in the digital library and the document descriptors that describe the significant attributes of the documents such as authors, titles, publication information, etc. However, accessing the full text of the documents may require membership.
To store and index the schema and document descriptors crawled from the DL, ADLSearch facilitates the central schema and central database in the indexed data layer. The central schema defines a "standardised" schema adopted by the system. When ADLSearch crawls information from a new digital library, the schema of the new digital library will be extracted and mapped into the central schema. Based on the attributes defined in the central schema, the crawled document descriptors will be indexed and stored in the central database. Figure 9 illustrates a document record stored in our central schema.
In addition as ADLSearch collects information from multiple DL over the internet, a schema mapping table is also constructed to store all of the mappings between the schemas of the crawled DL and the central schema. When ADLSearch connects to a new digital library, the mappings between the central schema of ADLSearch and the schema of the new digital library will be generated and added to the schema mapping table. As discussed before, the proposed argumentation-based conflict reconciliation framework will be responsible for generating the contents of the schema mapping table and handling the conflicts.
Based on the indexed central database, the retrieving component will perform descriptor retrieval to retrieve documents whose descriptors matched the queries submitted by users via query processing. If the full texts of the retrieved documents are available either by the policy of the hosting DL or memberships of the users, then full text retrieval will retrieve the full text of the corresponding document. Finally, result producing displays the final retrieval results to the users.
System interface
In ADLSearch we have downloaded schemas from DL available at: www.loc.gov/z3950/gateway.html. In this page there are approximately 800 libraries supported with the Z39.50 protocol, thereby enabling automatic information access and retrieval for these libraries. The downloaded schemas are mapped and indexed in ADLSearch as discussed before. Similar to other search engines for DL, ADLSearch supports users to search any relevant scholarly information from the indexed DL. Currently the following search functions are supported in ADLSearch:
document search searches for documents related to the submitted keywords;
author search searches for publications of specified authors;
publisher search searches for documents published by specified publishers; and
expert search searches for experts in areas specified by keywords.
Figure 10 shows the document search interface of the system, where users can select to search from the targeted DL indexed by ADLSearch. Users can view the detailed information of a retrieved document as illustrated in Figure 11. If the user has the necessary permission, they can continue to retrieve the full text of the document from the digital library that hosts the document. One of the most special features of ADLSearch is that the system can allow users to add and index new digital library schemas in an automatic manner. Users can keep track of the mapping decisions, the generated arguments and their evaluated strengths as shown in Figure 12. Moreover, expert users can even view the information on the technical implementation such as the detailed information of the evaluation process (such as the defence graph) as shown in Figure 13.
Implementation
Regarding the technical implementation the system was developed using the Java programming language. As mentioned earlier the ADLSearch system currently employs three matching algorithms: COMA++, AMC and Ontobuilder. We have also used the Vispatrix (Charwat et al. , 2012) tool to support the generation of arguments from the outputs of the matching algorithms. In addition the ASP solver in DLV-complex (Calimeri et al. , 2008) was adopted to detect conflicts between arguments.
Experiment results
Research questions
To evaluate the performance of the proposed approach we conducted experiments to verify two hypotheses as follows:
H1. The argumentation approach can improve the schema matching accuracy compared to individual matching algorithms.
This claim has been supported in our previous work (Nguyen et al. , 2013), but we wanted to verify it again when applied to scholarly data sets collected from DL:
H2. Employing schema matching can improve the retrieval efficiency from various DL.
In addition, benefiting from better schema matching accuracy, the argumentation-based approach should achieve better retrieval precision. According to the two hypotheses, we evaluated the performance of our system using two measures: schema matching accuracy and retrieval efficiency, respectively. Appropriate metrics were adopted in the evaluation of these two measures.
Data sets and matching algorithms
In this experiment we collected the data set which comprises schemas of DL collected from the webpage www.loc.gov/z3950/gateway.html. We classified similar schemas into schema patterns. In total we had 71 patterns, which can be downloaded from www.cse.hcmut.edu.vn/~save/patterns.zip. Moreover, we only selected the matching tools for which the sources were available and without any licensing issues. Furthermore, in the evaluation we also used the three most popular schema matchers: COMA++ (Aumueller et al. , 2005), AMC (Peukert et al. , 2011) and OntoBuilder (Roitman and Gal, 2006) as given in Table III. These three matching tools are also deployed in ADLSearch.
Schema matching accuracy
We evaluated the schema matching accuracy of our argumentation-based approach compared to other individual matching algorithms. To carry out the experiments we selected a data set comprising 20 schema patterns, which covered about 2,000 corresponding records. We then manually produced the corresponding matching between these patterns. The manual matching generated is considered the ground truths of the experiment. Then we performed the schema matching algorithms on the data set of the 20 schema patterns. If the output of a certain schema matching agreed with the information in the ground truths, then we counted it as a hit, or otherwise a miss.
Then we defined the ratio of accuracy metric, which measures the number of hits over the total number of suggestions provided by the corresponding resolution strategy. It was calculated based on the following formula:
(Equation 2)
where algorithm is the matching algorithm involved and action_kind is the kind of matching action suggested by the algorithm. The action_kind can be approving or disapproving of a mapping of schema attributes. For example if the algorithm is COMA++ and the action_kind is approving, it means that we aim at evaluating the accuracy performance of the COMA++ algorithm when suggesting an approving action. As discussed earlier, upper and lower thresholds were used to identify whether a matching algorithm approves or disapproves a mapping. We tuned the values of upper and lower thresholds for each algorithm to identify the most appropriate thresholds used for approving and disapproving actions.
Table IV presents the results of the matching accuracy evaluation, where we have evaluated three kinds of actions: approving, disapproving and overall (combination of approving and disapproving). Among the three individual matching tools COMA++ has achieved the best matching accuracy. However, the argumentation-based approach has outperformed all of them. It has achieved an increase of about 20 per cent compared to the average of the three matching tools, and an increase of about 14 per cent when compared with the best matching tool, COMA++. It is especially significant that when the performance of the three individual tools is quite poor for the disapproving action, the argumentation-based approach can still maintain relatively good accuracy.
Retrieval efficiency
Since ADLSearch is an information retrieval system, we used traditional information retrieval metrics - precision, recall and F -measure (Rijsbergen, 1979) - to evaluate the retrieval efficiency of the system. In addition to highlight the advantages gained by schema matching algorithms when applied to retrieval from multiple DL, we also measured the performance of the baseline method based on the Z39.50 protocol, known as Base Z3950 . In this baseline method we merely used the result retrieved from Z39.50 when processing keyword queries without handling inconsistencies if they arose.
Table V compares the average performance of individual tools and the argumentation-based approach based on the data set. For each of the performance results we have presented the average precision, recall and F -measure. It is evident that Base Z3950 enjoys good recall performance, which is even better than that of the argumentation-based method. It can be explained by the fact that when handling inconsistencies, the schema matching algorithms may suffer from missing true positive cases (i.e. some correct matchings are not approved and therefore missing in the final results). However, Base Z3950 achieved very poor precision, meaning that many false positive cases are included in the final result due to the unresolved inconsistencies. As a result this baseline method is outperformed by all matching algorithms in terms of F -measure.
Among all the matching methods the argumentation-based approach is the clear winner in terms of precision, recall and F -measure. The argumentation-based approach has achieved an increase of about 17 per cent on precision and 3 per cent on recall compared to the average of the other three matching tools. For a predefined weighting of precision and recall for F -measure, the argumentation-based approach is also the best technique in the overall results. It achieved an increase of about 9 per cent on F -measure compared to the average of the three individual matching tools and 2 per cent when compared with the best.
Conclusion
This paper introduces a system called ADLSearch or ADLSearch. Basically this is a search engine designed for scholarly information retrieval from multiple DL distributed over the internet. On the one hand, ADLSearch makes use of the standard protocol Z39.50 to connect with external DL for crawling and indexing scholarly information. On the other hand, ADLSearch includes an internal argumentation-based conflict reconciliation framework, which uses the argumentation theory to handle inconsistencies when matching multiple schemas of the external DL. The framework supports new DL to be indexed by ADLSearch in an automatic manner. Currently ADLSearch has indexed over 800 DL and has achieved good scalable performance due to its use of some best practices for handling large-scale data sets at the server side.
Our research work has opened up some new research directions. First, we would like to design a negotiation protocol to enable negotiation within the ADLSearch system. Second, we intend to extend the notion of the proposed constraints to further consider the integrity constraints that are relevant in the praxis (e.g. functional dependencies, domain-specific constraints, etc.). Third, we intend to apply our proposed approach to other problems. While our work focuses on schema matching between DL, our techniques - especially the argumentation-based conflict reconciliation framework - could be applied to other tasks such as entity resolution or business process matching.
Figure 1
The same publication record may be represented and stored differently in different digital libraries
[Image omitted: See PDF]
Figure 2
Different terms for the same concept from different schemas and their mappings
[Image omitted: See PDF]
Figure 3
Conflict reconciliation framework
[Image omitted: See PDF]
Figure 4
Mappings between schemas by various algorithms
[Image omitted: See PDF]
Figure 5
Defence graph of arguments given in Table II
[Image omitted: See PDF]
Figure 6
Aggregated score of a mapping
[Image omitted: See PDF]
Figure 7
Final resolution for a mapping conflict
[Image omitted: See PDF]
Figure 8
The architecture of ADLSearch
[Image omitted: See PDF]
Figure 9
A document record stored in the central schema of ADLSearch
[Image omitted: See PDF]
Figure 10
Search interface of ADLSearch
[Image omitted: See PDF]
Figure 11
A document descriptor retrieved by ADLSearch
[Image omitted: See PDF]
Figure 12
Visualisation of mapping decisions and their corresponding scores
[Image omitted: See PDF]
Figure 13
Tracking a defence graph in ADLSearch
[Image omitted: See PDF]
Table I
Schema mapping table of the mappings depicted in Figure 4
[Image omitted: See PDF]
Table II
Arguments generated from the schema mapping table given in Table I
[Image omitted: See PDF]
Table III
Schema matching tools
[Image omitted: See PDF]
Table IV
Schema matching accuracy evaluation
[Image omitted: See PDF]
Table V
Retrieval efficiency evaluation
[Image omitted: See PDF]
Equation 1
[Image omitted: See PDF]
Equation 2
[Image omitted: See PDF]
About the authors
Tho Thanh Quan is an Associate Professor in the Faculty of Computer Science and Engineering at Ho Chi Minh City University of Technology (HCMUT), Vietnam. He received his BEng from HCMUT in 1998 and his PhD in 2006 from Nanyang Technological University. His current research interests include formal methods, programme analysis/verification, the semantic web, machine learning/data mining and intelligent systems. Currently he heads the Department of Software Engineering at HCMUT and also serves as the Chair of the Computer Science Programme (undergraduate level). Tho Thanh Quan is the corresponding author and can be contacted at: [email protected]
Xuan H. Luong earned his BSc in Computer Science at HCMUT. He is currently a master's student in computer science at the Swiss Federal Institute of Technology (EPFL, Lausanne). His research backgrounds consist of software verification, data integration and argumentation.
Thanh C. Nguyen is an Invited Lecturer at HCMUT, where he also obtained his PhD. His research interests include natural language processing, digital libraries and software engineering.
Hui Siu Cheung is an Associate Professor in the School of Computer Engineering at Nanyang Technological University. He received his BSc (1983) and DPhil (1987) from the University of Sussex. He worked at IBM China/Hong Kong as a System Engineer from 1987 to 1990. His current research interests include data mining, web mining, the semantic web, intelligent systems, information retrieval, intelligent tutoring systems, timetabling and scheduling.
This work was supported by research project B0212-20-02TD funded by Vietnam National University - Ho Chi Minh City.
References
Assaf, A., Louw, E., Senart, A., Follenfant, C., Troncy, R. and Trastour, D. ( 2012 ), "Improving schema matching with linked data", Computing Research Repository (CoRR) , available at: http://arxiv.org/abs/1205.2691 (accessed 27 November 2014).
Aumueller, D., Do, H.H., Rahm, E. and Massmann, S. ( 2005 ), " Schema and ontology matching with COMA++ ", Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, pp. 906 - 908 .
Belhajjame, K., Paton, N.W., Fernandes, A.A.A., Hedeler, C. and Embury, S.M. ( 2011 ), " User feedback as a first class citizen in information integration systems ", Proceedings of the Conference on Innovative Data Systems Research, ACM, New York, NY, pp. 175 - 183 .
Bentahar, J., Alam, R., Maamar, Z. and Narendra, N.C. ( 2010 ), " Using argumentation to model and deploy agent-based B2B applications ", Knowledge-Based Systems , Vol. 23 No. 7, pp. 677 - 692 .
Bernstein, P.A., Madhavan, J. and Rahm E. ( 2011 ), " Generic schema matching, ten years later ", Proceedings of the VLB Endowment, Vol. 4 No. 11, pp. 695 - 701 .
Besnard, P. and Hunter, A. ( 2008 ), Elements of Argumentation , MIT Press, Cambridge .
Bizer, C., Boncz, P., Brodie, M.L. and Erling, O. ( 2012 ), " The meaningful use of big data: four perspectives - four challenges ", ACM SIGMOD Record , Vol. 40 No. 4, pp. 56 - 60 .
Bloehdorn, S., Cimiano, P., Duke, A., Haase, P., Heizmann, J., Thurlow, I. and Völker, J. ( 2007 ), " Ontology-based question answering for digital libraries ", Research and Advanced Technology for Digital Libraries - Lecture Notes in Computer Science , In Proceedings of the 11th European Conference on Digital Libraries, Budapest, Hungary , 16-21 September, Vol. 4675, pp. 14-25, doi: 10.1007/978-3-540-74851-9_2.
Calimeri, F., Cozza, S., Ianni, G. and Nicola, N. ( 2008 ), " Computable functions in ASP: theory and implementation ", in Garcia de la Banda, M. and Pontelli, E. (Eds), Logic Programming , Lecture Notes in Computer Science , Vol. 5366, Springer, Berlin, pp. 407 - 424 .
Chang, C.L. and Lee, R.C.T. ( 1973 ), Symbolic Logic and Mechanical Theorem Proving , Academic Press, New York, NY .
Charwat, G., Wallner, J.P. and Woltran, S. ( 2012 ), " Utilizing ASP for generating and visualizing argumentation frameworks ", Proceedings of the 5th Workshop on Answer Set Programming and Other Computing Paradigms, pp. 51 - 65, available at: www.dbai.tuwien.ac.at/research/project/argumentation/papers/CharwatWW12.pdf (accessed 27 November 2014).
Chen, C.C. and Wang, J.Z. ( 2002 ), " Large-scale emperor digital library and semantics-sensitive region-based retrieval ", Proceedings of Digital Library - IT Opportunities and Challenges in the New Millennium, Beijing Library Press, Beijing, Vol. 8 No. 9, pp. 454 - 462 .
Chen, H.H., Gou, L., Zhang, X. and Giles, C.L. ( 2011 ), " CollabSeer: a search engine for collaboration discovery ", Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries (JCDL '11), pp. 231 - 240, doi: 10.1145/1998076.1998121.
Cinque, L., Malizia, A. and Navigli, R. ( 2004 ), " OntoDoc: an ontology-based query system for digital libraries ", Proceedings of the 17th International Conference on Pattern Recognition, Vol. 2, IEEE, Los Alamitos, CA, pp. 671 - 674 .
Doan, A. and Halevy, A.Y. ( 2005 ), " Semantic-integration research in the database community ", AI Magazine , Vol. 26 No. 1, pp. 83 - 94 .
Doan, A., Franklin, M.J., Kossmann, D. and Kraska, T. ( 2011 ), " Crowdsourcing applications and platforms: a data management perspective ", Proceedings of the VLDB Endowment, Vol. 4 No. 12, pp. 1508 - 1509 .
Dong, X.L. and Srivastava, D. ( 2014 ), " Big data integration ", Proceedings of 2014 IEEE 30th International Conference on Data Engineering, IEEE, Los Alamitos, CA, pp. 1245 - 1248 .
Duchateau, F., Bellahsene, Z. and Roche, M. ( 2007 ), "BMatch: a semantically context-based tool enhanced by an indexing structure to accelerate schema matching", available at: www2.lirmm.fr/~mroche/Web/Publications/All_papers/duchateau_BDA07.pdf (accessed 27 November 2014).
Dung, P.M. ( 1995 ), " On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n- person games ", Artificial Intelligence , Vol. 77 No. 2, pp. 321 - 357 .
Dung, P.M., Mancarella, P. and Toni, F. ( 2007 ), " Computing ideal sceptical argumentation ", Artificial Intelligence , Vol. 171 No. 10, pp. 642 - 674 .
Egly, U., Gaggl, S.A. and Woltran, S. ( 2010 ), " Answer-set programming encodings for argumentation frameworks ", Argument and Computation , Vol. 1 No. 2, pp. 147 - 177 .
Gonzalez, H., Halevy, A.Y., Jensen, C.S., Langen, A., Madhavan, J., Shapley, R., Shen, W. and Goldberg-Kidon, J. ( 2010 ), " Google fusion tables: web-centered data management and collaboration ", in Elmagarmid, A. and Agrawal, D. (Eds), ACM's Special Interest Group on Management of Data (SIGMOD 2010) , ACM, New York, NY, pp. 1061 - 1066 .
Grosse, K., Chesñevar, C.I. and Maguitman, A.G. ( 2012 ), " An argument-based approach to mining opinions from Twitter ", Proceedings of the First International Conference on Agreement Technologies, Vol. 918, pp. 408 - 422, available at: http://ceur-ws.org/Vol-918/111110408.pdf (accessed 27 November 2014).
Heras, S., de la Prieta, F., Rodríguez, S., Bajo, J., Botti, V.J. and Julián, V. ( 2012 ), " The role of argumentation on the future internet: reaching agreements on clouds ", Proceedings of the First International Conference on Agreement Technologies, Vol. 918, pp. 393 - 407, available at: http://ceur-ws.org/Vol-918/111110393.pdf (accessed 27 November 2014).
Hutchinson, H.B., Rose, A., Bederson, B.B., Weeks, A.C. and Druin, A. ( 2005 ), " The international children's digital library: a case study in designing for a multilingual, multicultural, multigenerational audience ", Information Technology and Libraries , Vol. 24 No. 1, pp. 4 - 12 .
Jean-Mary, Y.R., Shironoshita, E.P. and Kabuka, M.R. ( 2009 ), " Ontology matching with semantic verification ", Web Semantics , Vol. 7 No. 3, pp. 235 - 251 .
Joint, N. ( 2010 ), " The one-stop shop search engine: a transformational library technology?: ANTAEUS ", Library Review , Vol. 59 No. 4, pp. 240 - 248 .
Kabak, Y. and Dogac, A. ( 2010 ), " A survey and analysis of electronic business document standards ", ACM Computer Survey , Vol. 42 No. 3, pp. 11 - 31 .
Kent, R.E. and Bowman, C. ( 2011 ), "Digital libraries, conceptual knowledge systems, and the Nebula interface", Computing Research Repository (CoRR), available at: http://arxiv.org/abs/1109.1841 (accessed 27 November 2014).
Kruk, S.R. ( 2010 ), "Semantic digital libraries - improving usability of information discovery with semantic and social services", PhD thesis, National University of Ireland, Galway.
Marie, A. and Gal, A. ( 2007 ), " On the stable marriage of maximum weight royal couples ", Proceedings of AAAI Workshop on Information Integration on the Web (II-Web07), AAAI Press, Menlo Park, CA, pp. 62 - 67 .
Melnik, S., Garcia-Molina, H. and Rahm, E. ( 2002 ), "Similarity flooding: a versatile graph matching algorithm and its application to schema matching", paper presented at International Conference on Data Engineering (ICDE), San Jose, CA, 26 February-1 March.
Nguyen, Q.V.H., Luong, H.X., Miklos, Z., Quan, T.T. and Aberer, K. ( 2013 ), " Collaborative schema matching reconciliation ", Proceedings of the 21st International Conference on Cooperative Information Systems (CoopIS 2013), Springer, Berlin, pp. 222 - 240 .
Nguyen, T.T., Nguyen, Q.V.H. and Quan, T.T. ( 2012 ), " A framework to combine multiple matchers for pair-wise schema matching ", Proceedings of 2012 IEEE RIVF International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF 2012), IEEE Press, Los Alamitos, CA, pp. 1 - 6 .
Papadakis, I., Kyprianos, K., Mavropodi, R. and Stefanidakis, M. ( 2009 ), " Subject-based information retrieval within digital libraries employing LCSHs ", D-Lib Magazine , Vol. 15 Nos 9/10, doi:10.1045/september2009-papadakis, available at: www.dlib.org/dlib/september09/papadakis/09papadakis.html (accessed 26 November 2014).
Peukert, E., Eberius, J. and Rahm, E. ( 2011 ), " AMC - a framework for modelling and comparing matching systems as matching processes ", Proceedings of the IEEE 27th International Conference on Data Engineering (ICDE), IEEE Press, Los Alamitos, CA, pp. 1304 - 1307 .
Phan, M.D. ( 1995 ), " On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n -person games ", Artificial Intelligence , Vol. 77 No. 2, pp. 321 - 358 .
Pinkel, C., Binnig, C., Kharlamov, E. and Haase, P. ( 2013 ), " IncMap: pay as you go matching of relational schemata to OWL ontologies ", Proceedings of Ontology Matching 2013 (OM 2013), pp. 37 - 48, available at: http://ceur-ws.org/Vol-1111/om2013-proceedings.pdf #page=46 (accessed 27 November 2014).
Prakken, H. ( 2012 ), " Some reflections on two current trends in formal argumentation ", in Artikis, A., Craven, R., Kesim Çiçekli, N., Sadighi, B. and Stathis, K. (Eds), Logic Programs, Norms and Action , Springer, Berlin, pp. 249 - 272 .
Quan, T.T., Fong, A.C.M. and Hui, S.C. ( 2007 ), " A scholarly semantic web system for advanced search functions ", Online Information Review , Vol. 31 No. 3, pp. 353 - 364 .
Rahwan, I., Zablith, F. and Reed, C. ( 2007 ), " Towards large scale argumentation support on the semantic web ", Proceedings of the 22nd National Conference on Artificial Intelligence, Vol. 2, AAAI Press , Menlo Park, CA, pp. 1446 - 1451 .
Ravindranathan, U., Shen, R., Gonçalves, M.A., Fan, W., Fox, E.A. and Flanagan, J.W. ( 2004 ), " ETANA-DL: a digital library for integrated handling of heterogeneous archaeological data ", Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, ACM , New York, NY, pp. 76 - 77 .
Rijsbergen, C.J.V. ( 1979 ), Information Retrieval , 2nd ed., Butterworths, London .
Roitman, H. and Gal, A. ( 2006 ), " OntoBuilder: fully automatic extraction and consolidation of ontologies from web sources using sequence semantics ", Proceedings of the 2006 International Conference on Current Trends in Database Technology, Springer , Berlin, pp. 573 - 576 .
Sapena, O., Torreño, A. and Onaindia, E. ( 2011 ), " On the construction of joint plans through argumentation schemes ", Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems, Vol. 3, International Foundation for Autonomous Agents and Multiagent Systems , Richland, SC, pp. 1195 - 1196 .
Saracevic, T. and Dalbello, M. ( 2001 ), " A survey of digital library education ", Proceedings of the American Society for Information Science and Technology, Vol. 38, pp. 209 - 223 .
Shvaiko, P. and Euzenat, J. ( 2013 ), " Ontology matching: state of the art and future challenges ", Knowledge and Data Engineering , Vol. 25 No. 1, pp. 158 - 176 .
Song, H., Ma, F. and Wang, C. ( 2005 ), " Clustering-based schema matching of web data for constructing digital library ", in Gervasi, O., Gavrilova, M.L., Kumar, V., Laganà, A., Lee, H.P., Mun, Y., Taniar, D. and Tan, C.J.K. (Eds), Computational Science and Its Applications - ICCSA 2005 , Springer, Berlin, pp. 1086 - 1095 .
Tho Thanh Quan Department of Software Engineering, Ho Chi Minh City University of Technology, Ho Chi Minh, Vietnam
Xuan H. Luong Department of Computer Science, Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam
Thanh C. Nguyen Department of Computer Science, Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam
Hui Siu Cheung Department of Computer Engineering, Nanyang Technological University, Singapore
© Emerald Group Publishing Limited 2015
