Full Text

Turn on search term navigation

1. Introduction

The use of data for making public policies, detecting customer needs, or running scientific experiments is increasing. The datasets are exposed within the data marketers for trades or shared using open data portals. The data portals manage the data openly and share them for general purposes [1,2]. Data have been created by public institutions around the world. The shared data have many uses, but the general purpose of these portals is to help users to find the relevant data for their purpose, such as, data analytics, which can be used to increase profit. Data sharing at this scale has enabled the concept of open data platforms (CKAN (CKAN: http://ckan.org/, accessed on 15 June 2024), DKAN (DKAN: http://getdkan.org/, accessed on 15 June 2024), Socrata (Socrata: https://www.tylertech.com/products/socrata/, accessed on 15 June 2024), and so on) that manage data and support retrieval [3]. In an open data platform, data retrieval is limited to keyword retrieval services. The open data platform manages the datasets and provides search capabilities using the metadata of the dataset. However, related datasets can be searched with the help of different data analysis tools, such as text analysis, statistical analysis, predictive analysis, and prescriptive analysis. Such metadata can be expressed using Data Catalog (DCAT) or Resource Description Framework (RDF) vocabulary [4] (with W3C as a standard for developing linked open data). DCAT is a vocabulary of metadata formats that describes datasets in data catalogs. Conversely, RDF is a general framework to represent the interconnected data on the web from multiple sources. RDF data are a set of triples <subject, predicate, object>, where the statement of the predicate is in between the subject and object. Linked open data (LOD) is a combination of publically linked data, and the amount of data has increased recently. There have been studies on searching for a link between datasets [5,6]. Research has been performed to create or detect instances of relationships between datasets on data portals. For example, SILK [5] concatenates RDF data with different vocabularies. The connection policy designed by the user evaluates the similarity between the triples of two RDF datasets and creates a connection if the condition is met. Conversely, PARIS [6] uses a probability-based automatic link connection scheme. It also measures associations between classes and ontology and not just instance matching. However, this scheme has faced some difficulties in predicting while reflecting the hierarchical structure among datasets. Governments such as the United States and Australia have established national DataMaps to conveniently provide public institution-owned data to users by utilizing relationships between datasets. These relationships are expressed in the form of a graph that provides data exploration for users. However, current services solve the problem of datasets’ associations when they belong to the same taxonomy or have the same attribute name in a dataset schema. In addition, predicting concepts from the users’ submitted queries can be considered as a time-series forecasting problem, as with other problems, e.g., [7]. The state-of-the-art frameworks for searching relevant datasets lack the potential to scale up to large data portals containing billions of datasets, due to the space and time complexity [8,9,10,11,12,13]. Moreover, most of the existing systems do not provide a search based on the metadata of datasets by utilizing the DCAT catalog vocabulary [14]. Additionally, very few studies have been carried out to curate the dataset by considering the semantic relationship between the datasets. Therefore, the primary objective of this research is to develop a highly efficient storage and retrieval framework for recommending relevant datasets from large data portals in response to user queries. As data repositories continue to grow exponentially, the challenges of space utilization and search efficiency have become critical. Traditional methods are often inadequate in handling the immense computational and storage requirements, especially when dealing with vast matrices of dataset information. To address these limitations, we propose a novel approach that leverages matrix compression, indexing, and caching strategies to optimize both space and time during the search process. In this study, the relationship definition between datasets and concepts is constructed using a domain category graph (DCG) that organizes the data and metadata in a matrix format hierarchically. To reduce the turnaround time as well as provide the most relevant responses to the users, sophisticated prefetching models along with indexing and caching techniques are proposed in this scheme. Matrices for compression and materialization are devised to reduce the large space in the system. Finally, to analyze the effectiveness of our proposed system, we compare our proposed scheme with the existing [15,16] schemes by measuring the precision, recall, F1-score, and accuracy, as well as compression during the runtime of five queries. To achieve this, the following specific objectives are set:

Develop a Concept-Based Dataset Recommendation Algorithm: We aim to design an algorithm that can efficiently generate a Concept Matrix and a Dataset Matrix, enabling accurate and fast comparisons between the concept vectors of user queries and the registered datasets. The objective is to rank datasets in relevance to the user’s search query, enhancing both precision and recall in the search results.
Reduce Storage Requirements through Matrix Compression: One of the critical goals is to reduce the large storage requirements of dataset matrices by implementing effective matrix compression techniques. This will help in storing large datasets more efficiently, ensuring minimal degradation in search performance while achieving a significant reduction in space usage.
Improve Search Efficiency Using Indexing and Caching Mechanisms: To minimize the time required to process search queries, the proposed system will employ advanced indexing and caching strategies. The use of a bloom filter-based cache is intended to accelerate query response times by reducing redundant searches and utilizing fast memory access. The objective here is to cut down both computational time and storage space for frequent queries.
Incorporate Domain-Specific Ontologies for Semantic Search: To enhance the semantic understanding of user queries, the proposed approach will integrate domain-specific ontologies. This will allow the system to interpret the underlying meaning of the search terms more effectively, leading to improved retrieval of relevant datasets that match the user’s intent.
Compare and Benchmark Performance against Existing Solutions: The final objective is to comprehensively evaluate the performance of the proposed system against existing methods such as OTD and DOS. The comparison will focus on metrics like precision, recall, F1-score, accuracy, and runtime efficiency, with the goal of demonstrating significant improvements in both storage optimization and search efficiency.

The rest of the paper is organized as follows: In Section 2, we describe the related works. The methodology is depicted in Section 4, which includes the compression, indexing, and caching techniques that we use in our proposed system. We detail the experiments and results in Section 5; finally, conclusions are drawn in Section 6.

2. Related Works

In this section, we categorize the existing related work into three key areas: Data Relationship Prediction Techniques, which explore methodologies for identifying connections between datasets and enhancing data interoperability; Recommendation Systems, focusing on techniques for providing relevant content suggestions based on user preferences and historical interactions; and Open Data Classification Systems, which address frameworks and tools for managing and publishing open datasets to facilitate better access and utilization. Each subsection examines the current landscape of research, highlighting both the strengths and limitations of existing approaches.

2.1. Data Relationship Prediction Techniques

Recent research focuses on establishing connections between data. For providing more immediate responses, a cache strategy has been used [17]. An expiry factor based on the most frequently used (MFU) and least recently used (LRU) heuristics was devised to recognize stale data in cache. The Internet serves as a platform for data manipulation and knowledge sharing, with the Semantic Web enabling machines to understand and utilize this information. A detailed review was conducted by Wu, D. et al. [18] on extending RDF for temporal data, comparing methods for representation, querying, and storage, and exploring extensions for spatial, probabilistic, and other dimensions. However, while RDF-star extends RDF [19] to allow statements about statements, generating RDF-star data remains underexplored. Morph-KGCstar, an extension of the Morph-KGC engine, addresses this gap by generating RDF-star datasets, validated through tests and applied to real-world use cases, showing better scalability for large datasets but slower performance with multiple smaller files. The increasing volume of RDF data has led to the use of compression techniques to manage dataset sizes, and recent research has focused on adapting convolutional neural networks (CNNs) for RDF graph data. The proposed Multi Kernel Inductive RDF Graph Convolution Network (MKIR-GCN) [20] efficiently compresses RDF graphs by leveraging node similarities and structure, improving compression performance over existing methods.

Semistructured data systems take in RDF dump files or SPARQL endpoints as input. Specifically, LIMES [21] was designed to solve the Link Discovery challenge (i.e., the generic task of identifying relations between entities). If its scope is narrowed to the finding of sameAs relations, LIMES become appropriate for ER. This system utilizes a wide range of character- and token-based similarity measurements in addition to tailored blocking approaches. For efficient matching, semi-supervised combinations are learned from the similarity measurements. Unlike SERIMI [22], Duke [23], and KnoFuss [24], neither of these tools lacks a user-friendly graphical user interface. These systems primarily concern themselves with matching, applying effective but customized strategies based on similarity measurements, as opposed to blocking, where they only apply simple blocking techniques to literal values. MinoanER [25] and JedAI [26] are two hybrid tools that work equally well with both structured and semistructured information. This is feasible, since their procedures are schema independent. Indeed, they employ primary non-learning and schema-agnostic methods for blocking, matching, and clustering. Their block processing methods are likewise unique among operating systems. To better comprehend the development of RDF data, Pelgrin et al. [27] suggested a framework. A collection of metrics and an application to compute them throughout the whole dataset constituted the framework. Low-level change metrics and high-level change metrics are two groups of metrics that measure the differences between two versions of a given RDF structure or dataset. De Meester et al. [28] provided an alternate validation strategy based on rule-based reasoning that allowed for extensive modification of the inferencing procedures. They provided a theoretical foundation and a practical application and compared their work to previous methodologies. Their method, which supported the same number of constraint aspects as the current state of the art, provided a more detailed explanation of why violations occurred, with the help of the reasoner’s logical proof, and a more accurate count of violations, with the help of the modifiable inferencing rule set. However, Kettouch and Luca [29] introduced LinkD, a novel data interlinking technique that simply required a source dataset and output identities via external linkages to several linked data cloud sources. LinkD was built according to the specifications of the RDF file. To determine the degree of similarity between the resource names, several cutting-edge distance measuring methods and algorithms were used. The proposed method by Deepak and Santhanavijayan [30] leveraged the shortened relational RDF entities across a collection of webpages to generate query indicator words, which were then used to derive an RDF prioritizing vector. This method was novel because it used the best-fit occurrence estimate methodology to generate query facets from domain ontologies with query indicator words, allowing for dynamic query extension. In their work, Niazmand et al. [31] introduced the grouping-based summarization (GBS) and query-based summarization (QBS) methods for summarizing RDF graphs. The second technique was an improved lossless variant of the first. They conducted an evaluation of the efficacy of the proposed lossless graph data summarization to obtain full data by reconstructing an RDF Query Language using fewer triple patterns based on semantic similarity. In an RDF/SPARQL context, Ferrada et al. [32] offered algorithms that made it easier to quickly compute multidimensional similarity joins, where similarity in an RDF graph was measured by a set of attributes chosen in an SPARQL query, while research on similarity joins has been conducted in other settings, the difficulties inherent in RDF graphs require a fresh approach. Here, we analyze the feasibility of adding a similarity joining operator to the SPARQL language and explore possible strategies for implementing and optimizing such an operator. The increasing volume of RDF data has led to the use of compression techniques, with traditional compressors rarely exploiting graph patterns or structural regularities. Sultana et al. [33] proposed a hybrid TI-GI approach and RDF-RR to manage RDF datasets with named graphs, reduce structural redundancies, and achieve more compact serialization, significantly improving compression and indexing time compared to existing methods like HDT, HDT-FoQ, and 2Tp. This paper [34] surveys pre-trained language models for keyphrase prediction (PLM-KP), addressing the gap in jointly exploring keyphrase extraction and generation, introducing taxonomies for these tasks, and offering insights and future directions in NLP.

2.2. Recommendation Systems

The representative service among the services for the measurement of relevance is a content recommendation system. Content recommendation systems can be classified into content-based filtering and collaborative filtering systems. Content-based filtering is a technique for navigating and recommending items that are similar to the items consumed in the past, based on the user’s history, which is a technique for creating and recommending groups. In the context of the tourism industry, Lorenzi et al. [35] introduced an “assumption-based multiagent” system for making package recommendations based on user preferences. Depending on the needs of the user, it would either search for relevant information, filter it, or combine the results to generate a custom vacation package. For online courses, Salehi and Kmalabadi [36] suggested “modelling of materials in a multidimensional space of material’s attribute” as a way to generate recommendations. Content filtering and group work were used in this approach. Hence, for the purpose of creating posting suggestions in asynchronous discussion forums, Kardan and Ebrahimi [37] created a hybrid recommender system. Together, collaborative filtering and content-based filtering formed the basis of the system’s design and implementation. It took into account latent user data in order to compute user similarity with other groups. Conversely, collaborative filtering only focuses on the user’s archival predilection on a set of items. The main concept of this filtering process is that users who have agreed on some items in the past tend to agree on those in the future [38]. Göğebakan et al. [39] introduced a novel ontology-based assistive prescription recommendation system for patients with both Type-2 Diabetes Mellitus (T2DM) and Chronic Kidney Disease (CKD), combining drug dose prediction, drug–drug interaction warnings, and potassium-rise drug alerts, offering a first-of-its-kind comprehensive solution for clinicians. Oliveira and Oliveira [40] proposed an RDF-based graph for representing and searching specific parts of legal documents, allowing for more precise retrieval of legal information, supported by an ontological framework that captures the structure and relationships within legal systems. The approach yielded significant results when querying document parts related to specific terms.

Research on providing recommendation services through association measures has been conducted in various domains. Representative research is a movie recommendation technique [41] using movie metadata. This approach built an ontology using the metadata of a movie and proposed a movie recommendation method based on it. The similar relationship between movie genres was defined and provided based on the user’s favorite movie metadata. It made recommendations by generating similarity using movie genre; however, it was limited to a linear relationship, not a hierarchical structure. Another study provided related data retrieval using the hierarchical structure of data. Lee et al. [42] suggested a recommender system for music streaming that took into account the user’s past and present smartphone activities. Naive Bayes, Support Vector Machine, Multilayer Perception, Instance-based k-Nearest Neighbor, and Random Forest were a few of the machine learning techniques used in the proposed system. By combining the methods from the perspectives of ontology structure-oriented metrics and concept content-oriented metrics, Dong et al. [43] established a framework for a service-concept recommender system based on a semantic similarity model. When compared to other recommender systems, this one performed exceptionally well. To accurately forecast a user’s most likely online navigations, Mohanraj et al. [44] proposed the “Ontology-driven bee’s foraging approach (ODBFA)”. Using a scoring method and a similarity comparison, the self-adaptive system attempted to capture the varying needs of the online user. With the help of Semantic Web rule language [45], users can efficiently extract the dataset’s features and boost the model’s performance. Conversely, to evaluate the seriousness of a health problem, Torshizi et al. [46] created a hybrid recommender system based on hybrid fuzzy ontology. Benign prostatic hyperplasia patients may benefit from its suggested treatments. To better understand the structures of entity descriptors and linkages in an RDF dataset, Wang et al. [47] proposed to construct a pattern coverage sample that best depicted these patterns. Specifically, they used formulations of a set Steiner graph and set cover issues to produce concise pieces of programming. This flexible method may also represent query relevance for use in dataset searches.

2.3. Open Data Classification Systems

For government and private sector entities, there are several open data publication options from which to choose. One such free and open-source program is the Comprehensive Knowledge Archive Network (CKAN), which was created by the Open Knowledge Foundation. CKAN was created to be easily altered to suit individual needs. Python is used for the server-side logic of the CKAN web-based archive while PostgreSQL [48] is used for storing the dataset metadata. The Apache Solr-based search engine is also included with CKAN. Solr is a free and open-source enterprise search platform [49]. Solr enables searching through text, and it works with a variety of ranking algorithms, including BM25 [50] and TD-IDF. Further, a Learning To Rank (LTR) module can be added to Solr to enhance the ranking of the retrieved documents using machine learning models [51]. Central to the EU’s Digital Single Market (DSM) and open data agenda is the EU Open Data Portal. CKAN is just one of several open-source projects that have been used in the creation of the portal. Metadata catalogs allow users to search for information. In addition to CKAN, platforms such as Drupal (https://www.drupal.org, accessed on 30 June 2024), DKAN (https://getdkan.org/, accessed on 30 June 2024), and Socrata (https://open-source.socrata.com/, accessed on 30 June 2024) offer similar functionality. However, these options do not include built-in search engines and instead rely on third-party crawlers to index and find links to content. Recently, the importance of open data management has been increasing due to the increasing interest in open data and data sharing. To this end, an open data taxonomy [3] has emerged that organizes categories with similar or related attributes in a hierarchical structure based on industry or application scope. DCAT organizes vocabulary that can represent the metadata of various datasets, but since it is difficult to express all datasets, it is necessary to redefine DCAT-AP type metadata vocabulary for each domain. However, the open data taxonomy defines the metadata schema essential for the application on a per-category basis. The open data classification system does not need to redefine the metadata vocabulary of the dataset in the data portal and can be used to provide related data retrieval functions.

The effectiveness of existing data relationship prediction techniques, such as SILK and PARIS, can be enhanced by integrating a framework that utilizes ontology-based semantic search for ranking datasets based on keyword relevance. By leveraging metadata and ontologies to improve similarity measures, the herein proposed system will achieve higher accuracy in predicting relevant data connections. Thereby, this hypothesis builds on existing literature that emphasizes the importance of semantic relationships in enhancing data discovery and retrieval effectiveness.

3. Preliminaries

3.1. Domain Category Graph (DCG)

DCG is a taxonomy that represents and groups the datasets in a hierarchical structure based on their semantics and structural similarities, generated and verified by domain experts. The semantics and structural information are extracted from the domain knowledge presented in the metadata. DCG represents the concepts and it allows multiple inheritances. Each node contains a set of datasets having a narrow but strongly related concept to the parent node. A DCG can be defined as $D C G = (V, E)$ , where V represents the node, which is a dataset domain concept, and E is the relationship between the child and parent node in the taxonomy. All datasets must be included in at least one category of V. When there are two datasets, $D 1$ and $D 2$ , belonging to the domain category graph, the DCG relationship between $D 1$ and $D 2$ is formed if there is a vertical relationship in DCG, but the reverse does not hold. SKOS vocabulary [38], which is a W3C standard, is used to specify the relationship in a DCG. Figure 1 shows one example of a DCG.

3.2. Metadata

Metadata is the structured information of the data that explain the resources, for example, dates, title, description, and creators allied with that dataset, which permits anyone to efficiently discover, evaluate, and reuse that data in the future. For example, in a library catalog, books are categorized according to the author, title, and subject. However, in a phone book, phone numbers are categorized according to the assignee’s name. If the metadata of a dataset is m, and there are n attributes associated with that dataset, then the metadata of the dataset can be expressed by Equation (1).

(1) $D_{m} = \{P_{1}, P_{2}, \dots, P_{n}\} .$

Here, the attribute is represented by a pair consisting of a token and a concept, which can be represented as Equation (2),

(2) $P_{t} = \{\begin{matrix} P_{token}, P_{concept} \end{matrix}\} .$

In our system, we extract metadata from the Comprehensive Knowledge Archive Network (CKAN) data portal using the Data Catalog Vocabulary (DCAT) vocabulary. CKAN, a free and open-source data management system (DMS), powers platforms for data aggregation and sharing, making it easy to share and access data. Meanwhile, the DCAT is an RDF (Resource Description Framework) vocabulary designed to promote interoperability among data catalogs available on the web. Publishers can improve their catalogs’ discoverability and make it possible for apps to use information from different catalogs by describing datasets using DCAT. The metadata of each dataset contains the title and description of that dataset. Then, we use WordNet to remove the stop words and lemmatization, which extracted the tokens, $P_{t o k e n}$ , from the metadata. We also extract the concept, $P_{c o n c e p t}$ , from the metadata of the dataset, related to the DCG and DCAT vocabulary.

4. Materials and Methods

In our proposed system, the search process starts when the user types a query into the search field. The system then uses NLP to extract the terms from the user’s search query. If this user’s query was used previously, then our proposed system employs our proposed cache manager to represent the ranked list of the dataset. Otherwise, the system employs our introduced concept matrix, where each element represents the structural similarity between two concepts. Concepts from the ontology are compared with the extracted terms. Then, the system uses our proposed formula to generate the dataset matrix from the datasets retrieved from the data portals as well as concepts’ similarities matrices, which are then used to curate the related ranked datasets by comparing the concept vectors of all the registered datasets in response to the user query. We propose a matrix compression strategy to reduce the storage space needed in the system. Moreover, we use a materialized DM matrix, which contained the results of a query, to represent the ranked list of the datasets. In addition, we employed indexing for the search speedup. Furthermore, we use prefetch documents to optimize the execution time of the program. The system stores the ranked list of the dataset in the cache manager for further use. We can use data portals from different datasets such as transport, education, environment, location, and others. The methodology process diagram of our proposed system is shown in Figure 2. The details of each element are described in the following sections.

4.1. Process for Generating Concept Matrix

Ontology is used for exchanging data at both a common syntactic level and a shared semantic level. However, when matching two ontologies, an ontology matching operation must be performed, where two ontologies are provided as input, and the process returns a set of corresponding relations between their entities as output. In our system, we utilize structural similarity, which is based on extracting similarities using the features of the domain category graph. The structural similarity is extracted using the Wu and Palmer method, which calculates the relatedness by considering the depths of the two synsets and the depth of their Least Common Subsumer (LCS) [52]. The formula for calculating the similarity score based on Wu and Palmer is provided in Equation (3):

(3) $ConSim (s_{1}, s_{2}) = 2 \times \frac{depth (L C S (s_{1}, s_{2}))}{(depth ((s_{1}) + depth ((s_{2}))},$

where

L C S (s_{1}, s_{2})

is the lowest node in the hierarchy that is a hypernym of

(s_{1}, s_{2})

. From Figure 1, if we need to find the structural similarity between the two concepts “train” and “gasoline”, then the LCS of these concepts would be “traffic”, i.e., LCS (train,gasoline) = traffic. While the Wu and Palmer method provides a structured way to calculate similarity, it assumes equal importance for all sub-nodes under a common parent node. In reality, this is not always true, as certain sub-nodes may hold different levels of relevance based on contextual or domain-specific factors. For instance, in the context of transportation, “train” may have different implications and relevance compared to “gasoline”, which can influence the overall similarity score. This aspect highlights the potential limitations of using a purely depth-based approach for assessing similarity in more complex ontological structures, underscoring the need for our proposed system, which incorporates additional contextual and semantic factors to enhance the accuracy and relevance of similarity assessments.

The similarity score based on the Wu and Palmer method is never zero, as the depth of the root in a taxonomy is always one, and the score reaches one if the two concepts being compared are identical. This makes calculating similarity scores straightforward using the Wu and Palmer method. In our system, this similarity score is used to generate the Concept Matrix (CM), which is an m-by-m matrix where each element represents the structural similarity between two concepts. The more common information two concepts share (based on the information content of the concepts that subsume them in the taxonomy), the more similar they are compared to concepts that are unrelated. The CM, constructed using the Wu and Palmer structural similarity score, is illustrated in Figure 3.

The algorithm used to calculate the path length (Algorithm 1) for generating the Concept Matrix (CM) leverages the hierarchical structure of the ontology to compute the shortest path between two concepts. This path length is a crucial factor in constructing the CM, which captures the structural similarity between concepts based on their positions within the ontology. The process begins by accepting two inputs: the two concepts ( $C_{i}$ and $C_{j}$ ) whose path length needs to be determined, and the ontology graph, which represents the hierarchical relationships between various concepts. The algorithm traverses the graph to find the shortest path between these two concepts. The path length is defined as the least number of edges between the two concepts, and it is mathematically represented by the following equation:

(4) $path_length = ∣ least # of Edges between C_{i} and C_{j} ∣ .$

This equation essentially counts the number of edges that need to be traversed in the ontology graph to connect the two concepts, thereby providing a measure of their structural proximity.

The algorithm begins by initializing a global counter set to 1, which will be incremented as the path between the concepts is traversed. In the initial check, the algorithm determines whether the two concepts ( $C_{1}$ and $C_{2}$ ) are identical. If the concepts are the same, the algorithm immediately returns the counter value as 1, indicating that no traversal is required. However, if the concepts differ, the algorithm proceeds to query the ontology structure for the broader concepts associated with each of the input concepts. This is done using the Simple Knowledge Organization System (SKOS) vocabulary, which helps extract broader or more general concepts that reside higher up in the hierarchy. The algorithm then checks if the upper hierarchical concepts of the two input concepts match. If the broader concepts match, this indicates that both input concepts share a common parent in the hierarchy, and the algorithm returns the current value of the counter, incremented by 2, to account for the traversal to the common parent. However, if the broader concepts do not match, the algorithm continues to search by recursively calling itself with the next set of broader concepts in the hierarchy. This recursive process continues until a match is found, with the counter being incremented at each level of the hierarchy traversal.

One of the key features of the algorithm is its handling of cases where one concept is contained within the broader list of the other concept. In such cases, the index of the contained concept is determined, and the counter is adjusted based on the relative position of the concepts in the broader list. This ensures that the algorithm accounts for the hierarchical distance between concepts accurately, even when one concept is a sub-concept of the other. The counter is updated accordingly and returned as the final path length. Moreover, the algorithm includes specific checks to handle edge cases where certain predefined entities (such as rdflib.term.URIRef(‘:ontologies#Entity’)) are encountered. These entities represent root or top-level nodes in the ontology graph, and when encountered, the algorithm modifies the counter appropriately to reflect the distance to these entities.

Algorithm 1: Path length calculation algorithm to generate CM

Therefore, the algorithm employs a recursive approach to navigate the ontology graph and calculate the shortest path between two concepts. The use of broader concepts and hierarchical relationships allows it to traverse the graph efficiently, incrementing the counter as it moves up and down the hierarchy. The final path length returned by the algorithm is a direct measure of the structural proximity between the two concepts, and this value is used to populate the corresponding entry in the Concept Matrix (CM).

The process of calculating structural similarities between ontologies involves comparing how concepts are arranged within a hierarchical structure, such as a taxonomy. To simplify, the Wu–Palmer method calculates the similarity between two concepts by considering the “distance” between them in this hierarchy. Specifically, it looks at how deep each concept is in the hierarchy (how specific they are) and finds their lowest common ancestor (a broader concept they both share). The closer the common ancestor is to both concepts, the more similar they are considered to be. For example, if we want to compare two concepts like “train” and “gasoline”, the Wu–Palmer method identifies their lowest shared category—such as “traffic”—and calculates their similarity based on how deep “train” and “gasoline” are in the taxonomy compared to “traffic”. The algorithm assigns a higher similarity score when concepts share a closer common ancestor, as they are more related within that domain. In our proposed system, this similarity score forms the basis of the Concept Matrix (CM), which represents the relationships between all pairs of concepts. Each element in the matrix shows how structurally similar two concepts are, with higher values indicating stronger similarities. This approach helps us model how different parts of an ontology are related, which is crucial for applications that need to match or compare ontologies in a meaningful way.

The originality of the proposed algorithm for measuring distances between concepts and terms lies in its hybrid approach that combines structural similarity with contextual and semantic factors, addressing the limitations of existing algorithms that predominantly rely on purely depth-based metrics. Unlike traditional methods such as Wu and Palmer, which assume equal importance among sub-nodes under a common parent, our algorithm recognizes that the relevance of concepts can vary significantly based on contextual nuances and domain-specific considerations. By incorporating a path-length calculation that accounts for hierarchical relationships and contextual information from the ontology, our algorithm enhances the accuracy and relevance of similarity assessments. This innovative integration allows for a more nuanced understanding of concept relationships, providing a robust framework that adapts to the complexities of real-world ontological structures, ultimately leading to improved matching operations in applications requiring semantic interoperability. The flow chart of generating the concept matrix is given in Figure 4.

4.2. Process for Generating Dataset Matrix

In order to determine the similarity between the dataset and the concept, the dataset similarity should be measured. The dataset similarity computation is based on the Open Topic Detection (otd) system, which uses TF-IDF (Figure 5). The TF-IDF (term frequency-inverse document frequency) is a statistical measure to identify the importance of a word in a document or corpus. If the appearance of a word in a document is increased, the importance of that word will also increase proportionally, but it is offset by the frequency of the word in the corpus [53].

Therefore, each word’s TF-IDF weight can be computed by using two terms: the normalized term frequency (the number of times a word appears in a document divided by the total number of words in a document) and the inverse document frequency (the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears). The equation can be written as Equation (5),

(5) $w_{i, j} = t f_{i, j} x log \frac{N}{d f_{i}},$

where

t f_{i, j}

represents the number of occurrences of i in j,

d f_{i}

defines the number of documents containing i, and N denotes the total number of documents.

The TF-IDF matrix between the concept and the dataset of our system is shown in Figure 6. From the figure, it can be seen that the similarity between the concept, city, and Dataset 1 was 0.8 based on the TF-IDF process. The remaining values in Figure 6 were calculated using a similar process.

Then, the Dataset Matrix is generated using the CM and the TF-IDF matrix. The algorithm for generating the DM is given in Algorithm 2, and the matrix is represented in Figure 7. The DM also follows the Wu and Palmer method [52]. The dataset descriptions are linked to the concepts in the ontology either manually or automatically. In the case of the datasets, first, we identify the title and the description of the datasets from the CKAN data portal. Then, we extract the keywords by preprocessing, for example, deleting the stop words from the description of the dataset, and compare these with the CM values, based on the Wu and Palmer approach. Numerous links can be used between the concepts and the datasets. However, there might be some links from different concepts to the same dataset or from a single concept to multiple datasets. Sometimes, contradictory similarity scores may arise. If such a situation arises, we consider the highest similarity score. The DM of our ontology is given in Figure 7. Here, D1, D2, ..., Dn represents the dataset and city, traffic, .., LPG depicts the concepts. Therefore, based on the Wu and Palmer method the similarity between D3 and the facility is SIM (D3,facility) = 1.000.

Algorithm 2: DM algorithm

In Algorithm 2, the variable alpha represents a weighting factor used to balance the contribution of the similarity score obtained from the Concept Matrix (CM) and the existing dataset similarity scores. Specifically, alpha allows the algorithm to adjust the influence of the structural similarity derived from the CM in relation to the dataset similarity scores. By tuning the value of alpha, researchers can prioritize the importance of structural relationships between concepts and datasets, which can enhance the overall relevance and accuracy of the resulting Dataset Matrix (DM). This flexibility helps in refining the recommendations based on specific use cases or datasets, ensuring that the most pertinent information is highlighted in the similarity calculations.

The algorithm presented in Algorithm 2 is designed to generate a Dataset Matrix (DM_matrix) by calculating similarity scores between datasets and concepts. It takes two inputs, a dataset graph (which contains information about datasets) and an ontology graph (which contains the hierarchical structure of concepts). The algorithm is structured in several phases to iteratively compute and store these similarity scores. First, it initializes a similarity graph (sgraph) to represent the relationships between datasets and concepts. For each dataset identified in the graph, an “ambit_matrix” is initialized with a null value, preparing it for later score assignments. Next, the algorithm retrieves similarity relationships between datasets and concepts from the graph. For each similarity entry, it extracts the dataset, concept, and corresponding score. This score represents a predefined similarity measure between a dataset and a concept from the ontology. In the next phase, the algorithm iterates through each column in the similarity graph, checking if the column exists in the Concept Matrix (CM), which stores the relationship between different concepts. If a match is found, the similarity score is adjusted by multiplying it with the existing value in the CM for that concept. The algorithm then checks if the concept-specific score in the distance dataset (dds) is zero. Based on this check, it either calculates the score using a weighted alpha factor for the similarity score or a combination of the inverse of the distance dataset score (weighted by beta) and the alpha-weighted similarity score. Finally, the calculated score is used to update the DM_matrix by storing the maximum score between the newly calculated value and any pre-existing score for the given dataset and concept. This ensures that the highest score is retained in the matrix. At the end of the process, the algorithm returns the final DM_matrix, which contains the maximum similarity scores for each dataset-concept pair. The flow chart of generating the dataset matrix is given in Figure 8.

Materialized Dataset Matrix

In the data portals, the number of datasets is in the millions [4]. Therefore, the size of a DM is a bottleneck for dataset recommendation systems. We propose a materialized view for the DM (calculating the value of the DM on the runtime). Therefore, we can reduce the space complexity with the overhead of computational cost. To reduce the computational cost, we propose to maintain the index and cache strategies (explained in the next sections), which help to calculate and store only those values that are relevant to the search results of queries.

4.3. Compression

An index is maintained for computing and storing the DM matrix. The concepts, which are mostly related, are at the same level of the DCG graph hierarchy and are inherited from the same parent. These concepts are considered to be siblings. Therefore, a level-by-level order (starting from root to leaf node) is maintained and common parents are stored in the index file. We store the concepts in the CM and DM in the same order. The order of the concepts from Figure 7 is represented as Equation (6):

(6) $C (order) = \{C_{1} (city), C_{2} (traffic), C_{3} (facility), \dots, C_{N} ({leaf}_{node})\} .$

As discussed in the previous section, the DM contains vectors of the dataset similarity values against each concept. In the index, we stored the vector corresponding to each parent node starting from the root node level of ontology to a certain level depending on our threshold value.

The threshold value is proportional to the size of the RAM and the vector size that we used. When a query (in the form of a vector) arrives, the cosine similarity value is computed starting from the root node to the corresponding level nodes in the index. After finding the most similar vector, we start looking for the next concept vector in the DM if that was not already in the index. Therefore, by applying this indexing technique, we reduce the computational cost of finding related datasets and compress those datasets. The concepts in the similarity graph and the CM are stored according to the level order traversal of the DCG graph. The concepts in the top order contain more values in the similarity graph because of the abstract nature of the concepts. Moreover, in the similarity graph, we do not store values that had a low similarity. Therefore, the leaf node concepts are the last columns of the similarity graph and contain more empty values. For compression, at first, we hash the column and row IDs (dataset and concept IDs) of the TF-IDF similarity graph matrix in the form of integer values. Then, we group the rows and columns. Each group contains three rows and three columns. We store the values of those groups in a file. We do not store the empty groups. We modify the compression technique proposed in [17] for our system. Algorithm 3 contains the pseudocode of the overall process of our proposed compression scheme.

Algorithm 3: Compression algorithm

Input:
numOfDsInGrp ←3, numOfConceptInGrp ← 3, totalConcept ←
100, numOfColGrp ←
totalConcepts/numOfConceptInGrp, numOf RowGrp ←
totalDatasets/numOfDsInGrp, InitializeRowBlocks(), currentRowGrp ←
0, numbOf Block ← 0, indexTable ← array()
Output:
Compressed graph file

1:. while read a line from TFIDF-Graph do
2:. $d s I d \leftarrow l i n e [d a t a s e t]$
3:. $c p t I d \leftarrow l i n e [c o n c e p t]$
4:. $s i m V a l \leftarrow l i n e [s i m i l a r i t y]$
5:. $r o w G r p I d \leftarrow d s I d / n u m O f D s I n G r p$
6:. if $r o w G r p I d \neq c u r r e n t R o w G r p$ then
7:. for $i = 0$ to $n u m O f C o l G r p$ do
8:. if $i s E m p t y B l o c k [c u r r e n t R o w G r p]$ then
9:. $s t o r e R o w B l o c k ()$
10:. $b l o c k I d \leftarrow c u r r e n t R o w G r p * (n u m O f C o l G r p + i)$
11:. $i d e x T a b l e [n o O f B l o c k s] \leftarrow b I D n o O f B l o c k s + +$
12:. end
13:. end
14:. $I n i t i a l i z e R o w B l o c k s ()$
15:. $c u r r e n t R o w G r p \leftarrow r o w G r p I d$
16:. end
17:. $c o l V a l \leftarrow ((d s I d % n u m O f D s I n G r p) / n u m O f C o n c e p t I n G r p) + (c p t I d % n u m O f C o n c e p t I n G r p)$
18:. $b l o c k s [c p t I d] / n u m O f C o n c e p t I n G r p] [c o V a l] \leftarrow s i m V a l$
19:. $i s E m p t y B l o c k [r o w G r p I d] \leftarrow t r u e$
20:. end
21:. $I n i t i a l i z e R o w B l o c k s ()$
22:. $b l o c k s [n u m O f C o n c e p t I n G r p] [n u m O f C o l G r p * n u m O f R o w G r p] \leftarrow 0$
23:. end =0

It is designed to compress a similarity graph by grouping datasets and concepts into blocks and storing the data more efficiently. The algorithm takes several input parameters, including the number of datasets and concepts per group, the total number of concepts, and the total number of dataset groups. These parameters guide the partitioning of the data into manageable row and column blocks. The algorithm starts by reading the TF-IDF graph line by line, where each line contains a dataset ID, a concept ID, and a similarity value. Based on the dataset ID, the algorithm calculates the current row group ( $r o w G r p I d$ ), which helps in determining when to start a new block of data. If the row group changes, the algorithm checks if the current block is empty and stores the completed row block. It then initializes the next block for the new row group. Once the row group is set, the algorithm calculates the column value ( $c o l V a l$ ) based on the dataset and concept IDs. The similarity value is then assigned to the corresponding block in the matrix. The algorithm also updates the status of the block to indicate whether it contains data or is empty. After processing all the lines from the TF-IDF graph, the algorithm finalizes the compression by initializing the remaining row blocks and storing the blocks that were processed. The result is a compressed graph file where the data is grouped into blocks, allowing for more efficient storage and retrieval. The flow chart of the compression process is given in Figure 9.

4.4. Indexing

The approximate nearest neighbor (ANN) has been extensively used for search speedup by preprocessing the data into an efficient index. There are various variants of ANN that have been proposed for indexing a database. We use the Fast Approximate Nearest Neighbor (FLANN) [54], which is a library for performing fast approximate nearest neighbor searches in high dimensional spaces. We tested multiple algorithms, including randomized kdtree [55], hierarchical k-means, and automatic selection of the optimal algorithm for indexing from the FLANN library [54] to index the DM matrix. Among them, the hierarchical k-means gave the best results, with a branch size equal to 16.

4.5. Prefetch Documents

The prefetch documents are used to improve the efficiency of the executable by storing data about the files that will be loaded before the program is run. In response to the submitted query, a set of concepts against the query is computed in our proposed scheme. These concepts can be utilized as historical knowledge to predict future concepts. A log of concepts against user-submitted queries is generated, which is then utilized in the prediction model. Long Short-Term Memory Networks (LSTMs) are implemented to predict future concepts. The time-ahead concept prediction based on the history knowledge is formalized as Equation (7).

(7) $y (C n + 1), y (C n + 2), y (C n + 3) = p (y (C n)),$

where p represents the predictor,

y (C

) depicts the nth concept in history, and

y (C

) denotes the time-ahead concepts. These concepts were associated with a column vector in a DM matrix, which represents the similarity of the concepts against each dataset.

4.6. Caching

Computational performance is an important parameter for searching related datasets from large data portals, especially in real-time scenarios. It is an expensive approach to search for similar datasets directly from the matrix space (DM). Caching is an approach through which the whole dataset does not need to be searched, and it does not require many roundtrips between the client and the server. By using caching, the network delay and the cost of accessing the data are reduced. This is why the capability to reuse the previously fetched resources is a remarkable solution to optimize the performance of our system. In our proposed system, the memory caching is modeled in CPU caches. For caching the datasets, doubly linked lists or singly linked lists can be used. However, a doubly linked list is better, as it can traverse in both forward and backward directions, while the insertion and deletion of a node can be performed in $O (1)$ time if the node is directly accessed, the lookup to find the node that needs to be deleted requires $O (N)$ time in the worst-case.

Moreover, if the hash table and linked list are compared, the hash table has much better lookups than the linked list, which is O (1). Thus, in our proposed system, we use a hash table and doubly linked list for better searching in the cache. For cache eviction, we needed to determine which algorithm should be used if the storage was full, as the memory storage was limited in size. For this purpose, we analyzed the Least Frequently Used (LFU) algorithm [56], which selects and removes elements that have the least number of hits, and the Least Recently Used (LRU) algorithm [57], which selects and removes elements that have not been used recently. From the analysis, we found that the LRU was the most suitable for our system, as it evicted the element at approximately 99% that had the lowest quartile of use in time. The performance of our proposed system by using a cache was measured using Equation (8).

(8) $\begin{matrix} Average Memory Access Time = Hit rate + (Miss rate \times Miss penalty) . \end{matrix}$

A bloom filter [58] was also used before the caching system to improve the performance (Figure 10) of our proposed scheme. Because it is an array of integers, which does not store the actual items, it had no false negative probability and low false positive probability. It also saved expensive data scanning across several data servers, as it rapidly and memory-efficiently determined whether an element was present in the cache or not.

4.7. Semantic Search

A semantic search is a data-searching scheme that is not only used to decipher the words typed into the search box but also to understand the full meaning of the user’s inquiry. The semantics of the data depend on the metadata that represents the data. To effectively search through datasets, a semantic search engine must grasp not just the user’s intention (via the query) but also the data’s context (datasets). Since ontologies give clear formal descriptions of the concepts and relations in a domain, they play an important role in enhancing the precision of searches, as they allow for the resolution of ambiguity associated with keywords through the use of precise concepts. In this study, We propose a scheme that exploits the domain knowledge based on ontologies. During the dataset search using our proposed scheme, the system transforms the user’s query by using Natural Language Processing (NLP) schemes to the concepts from a well-defined standard ontology. The proposed system then converts the query into a vector, checks the index, calculates the corresponding dataset vectors in the DM matrix, and finds the semantic similarity, respectively. The output is the ranked list of datasets, based on the semantic similarity of the dataset vector and the DM vectors calculated on the runtime. The ranked result is displayed to the user after sorting and filtering. Therefore, applying ontology-based semantic search in our proposed scheme can improve the quality and efficiency of the dataset search.

5. Performance Evaluation

In this section, we will provide a brief overview of the datasets and experimental setups used to analyze our proposed scheme. Moreover, we have also included a comparative analysis of our proposed scheme in terms of the existing scheme.

5.1. Datasets and Experimental Setups

We harvested 600 datasets from a Google data search [4] and datahub (DataHub: http://old.datahub.io/, accessed on 2 July 2024) using “Transport” and related keywords. The extracted dataset stored the metadata of the dataset in a data portal, which was built on the CKAN platform, and used as training data for this experiment. In the experiment, a PC with an Intel Core i7 with 16GB RAM was used as the server for data management and index generation of the proposed algorithm. The NLP python library (NLTK) (NLTK: http://nltk.org/, accessed on 2 July 2024) and Wordnet [59] were used for natural language processing on the dataset metadata. The set of query datasets that we used for the experiments is given in Table 1.

5.2. Experimental Details

In this section, we will discuss the experimental details of our proposed system. The first step of our experiment was to preprocess the metadata of the user’s query datasets, which contained the title and description (dct:title, dct:description). These metadata were harvested from the CKAN server using the API call. The harvested data were then saved in RDF format in the form of a TF-IDF graph, which was used to generate the matrices. The second step was to calculate the similarity between all ontology concepts and save these in the form of a CM. As we already explained in Section 4, we stored all the matrices and similarity graphs in the level-wise traversal order of the DCG concepts. For our experiments, we added 200 concepts and 600 datasets to our DCG; therefore, the size of our similarity graph was 200 × 600 without compression. The size of storing this matrix on the RDF graph was 5 MB. Furthermore, after applying the proposed compression approach (discussed in Section 4), we reduced the size of the TF-IDF similarity graph to just 125 KB. Moreover, the time to access a block was also faster than most of the compression algorithms (i.e., 2 ms for one block row access in the worst-case scenario).

Another optimization was to create the index structure to calculate the similarity between the query vector and the DM. The search time for the matching dataset was reduced by one-tenth by applying the indexing explained in Section 4. However, the purpose of the indexing was to fetch the relevant dataset vector from the similarity matrix and to compute the DM for the fetch documents’ vectors at runtime. Thus, we did not store the whole DZ (materialized DM). Instead, we generated the index from the DM when the query occurred. We obtained only the relevant datasets from the index structure and prefetching algorithm and then uncompressed the relevant TF-IDF matrix values using a row-wise decompression algorithm. We computed the corresponding DM vectors using the decompressed rows and the CM vectors. Finally, we sorted the resulting vectors and obtained the relevant datasets. After these optimizations, we generated the results, which provided better accuracy, recall, and F1-score compared to [15] (Figure 11). The true values were generated by assigning the concepts to datasets manually by domain experts. For each dataset in the training set, three domain experts were assigned to semantically correct the concepts for each dataset. Moreover, the confusion matrix was generated by making a list of all the datasets related to each concept in the TF-IDF similarity graph. However, the search threshold (Ts) values were used to limit the set of datasets in the list. In our system, the true values were given in the form of ontology concepts’ tags to the training data. We observed that the tags were not evenly distributed; therefore, we also measured the F1-score, which considers both false positives and false negatives. Figure 11 shows all these matrices in the form of a bar chart. The statistical measures, such as the average of all queries’ results on different threshold values, are shown in the bar chart. We compared our system with [15,16]. However, Hagelien [15] provide only the keyword search and Jiang et al. [16] use hybrid indexing during searching. In our approach, we harvested the dct:title and dct:description from the input dataset, which is the URL of that dataset. Moreover, we employed our proposed compressing, indexing, caching, and semantic search algorithm to improve the performance of our proposed scheme. The results using different values of concept and search thresholds are given in Figure 11. The figure shows a clear improvement in the results by the given statistical measures.

5.3. Comparative Analysis

To analyze the efficiency of our proposed scheme, we used precision, recall, F1-score, and accuracy. The precision quantifies the number of positive class predictions that belonged to the positive class. Conversely, recall quantifies the number of positive class predictions made out of all the positive examples in the dataset. The F-measure provides a single score that balances both the concerns of precision and recall in one number. Furthermore, accuracy is the degree of closeness between a measurement and its true value. During the experimental analysis, we employed two types of threshold values named concept and search threshold. The concept threshold limits the similar concepts to be matched and the search threshold limits the relatedness of the search results datasets. Figure 11 represents the bar graphs for the OTD [15], DOS [16], and our proposed algorithm. We used different concepts ( $T c$ ) and search ( $T s$ ) threshold values, 0.7–0.75, 0.7–0.8, 0.75–0.775, 0.75–0.8, 0.8–0.775, and 0.8–0.8, against each query represented in Table 1. From Figure 11, we can observe that the precision, recall, F1-score, and accuracy were better for our proposed scheme. For example, the precision of our proposed scheme is approximately 69.60% and 15.26% better than the OTD and DOS schemes, respectively. Furthermore, the accuracy of our proposed scheme is approximately 30.09% and 15.914% better than the OTD and DOS schemes. This is because of the use of our proposed optimized DM, compressing, indexing, and semantic search algorithm. Therefore, our proposed scheme can increase efficiency and search quality while finding similar datasets during the search. Meanwhile, when the threshold value of $T s$ and $T c$ were 0.7–0.75, 0.75–0.775, and 0.8–0.775, respectively, then the accuracy was better for the OTD framework because the datasets that were semantically related were part of the results when the threshold value was lower. In addition to that the recall value was better than the OTD and DOS schemes when the threshold values of $T s$ and $T c$ were 0.7–0.75, 0.75–0.775, and 0.8–0.775, respectively. Moreover, the recall values of the OTD framework were less than 0.5, which showed that most of their values were not correct, and our approach gave more accurate results. Furthermore, the F1 score is a better predictor of results when there is uneven classification, as in our case. The F1-score was also better for our approach than the existing OTD and DOS schemes. Therefore, from the above analysis, we can conclude that our proposed scheme is much better than the existing OTD and DOS schemes.

To analyze the efficiency of our proposed scheme, we also calculated the average, standard deviation, variance, min, and max values of all the queries for all the metrics (precision, recall, F1-score, and accuracy), which is shown in Figure 12. The average or mean of a set of numbers is the value obtained by dividing the total number of values in the set by the sum of those numbers. A standard deviation is a measure of how distributed the data are in proportion to the average. The variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Finally, the min and max values are the lowest and the highest values. After analyzing Figure 12, we can observe that our proposed scheme has approximately 71% precision, 90% recall, 80% F1-score, and 70% accuracy. This is because it employs better compression, indexing, caching and prefetching strategies. For caching, we used a Level 1 (L1) graded cache (size 8∼64 KB), as it was part of the CPU chipset and faster to access. To determine whether the data were present in the cache or not, a bloom filter was added, which made the caching faster and used less storage space. The time needed by the bloom filter for either adding new elements or checking the existing elements was O(k). Therefore, from the above analysis, we can conclude that our proposed scheme is better than the existing schemes in all the statistical matrices.

Additionally, we compared the runtime on different queries of our proposed approach with the existing OTD and DOS schemes to analyze the effectiveness of our proposed scheme. Figure 13 shows that the proposed approach was faster than the OTD and DOS schemes due to its limited computation in the DM matrix. Our proposed approach computed and compared only those rows, which had the highest similarity, with the query vectors instead of computing and sorting the whole matrix. One limitation of this approach was that we limited the search results to k elements, which were difficult to change on runtime. During the analysis of the runtime of the proposed scheme, we calculate the total time from the user’s search query to the searched results represented as a ranked list of the datasets. Therefore, it includes the matrix rows’ decompression time, finding rows to decompress using the index, and the DM rows’ computation time. The average time breakout of our proposed scheme for the five queries is represented in Table 2.

Based on the insights gleaned from [60,61,62], we can identify several future proposals and practical applications that can be drawn from our research results:

Future Proposals
- −. Enhanced Dataset Documentation Standards:
  Building upon the findings regarding the values in dataset documentation from the computer vision study, future work could focus on developing standardized guidelines for documenting datasets across various domains. These guidelines could emphasize the importance of context, positionality, and ethical considerations in data collection, aligning with the call for integrating “silenced values”. This would not only enhance transparency but also improve the quality of datasets used in machine learning models.
- −. Ontology-Driven DBMS Selection:
  The development of an OWL 2 ontology for DBMSs opens opportunities for automated decision-making tools that leverage this ontology to recommend the most suitable database systems for specific use cases. By integrating our results with existing knowledge from DBMS literature, we can develop a user-friendly interface that allows practitioners to input their requirements and receive tailored DBMS suggestions, thus optimizing database management tasks in various applications.
- −. Local Embeddings for Custom Data Integration:
  Our research on local embeddings could be further explored in conjunction with the insights from deep learning applications in data integration. Future work could develop a hybrid framework that combines our graph-based representations with pre-trained embeddings to enhance integration tasks across diverse datasets. This approach would allow organizations to efficiently merge enterprise data while preserving the unique vocabulary and context of their datasets.
Practical Applications
- −. Improving Computer Vision Applications:
  The insights gained regarding dataset documentation in computer vision could lead to improved practices in creating datasets for applications such as facial recognition, object detection, and autonomous driving. By prioritizing contextuality and care in dataset creation, developers can create more robust models that perform well across varied real-world scenarios, thereby increasing trust and reliability in AI systems.
- −. Semantic Web Integration:
  The ontology developed for DBMSs can serve as a foundation for integrating semantic web technologies into database management practices. Organizations could utilize this ontology to create a semantic layer on top of their existing databases, enhancing data interoperability and allowing for richer queries and analytics that span multiple data sources.
- −. Advanced Data Integration Frameworks:
  The proposed algorithms for local embeddings can be applied to various data integration tasks beyond schema matching and entity resolution. For instance, they can facilitate the integration of heterogeneous data sources in healthcare, finance, and logistics by ensuring that the contextual relationships within data are maintained. This would support better decision-making and operational efficiencies across industries.
Connecting to Previous Literature By integrating the findings from the provided literature, we can see how our proposals align with and build upon existing research. For instance, the emphasis on dataset documentation in the computer vision paper resonates with our call for improved standards, while the ontology’s design in the DBMS study complements our vision of enhancing database selection processes. Furthermore, leveraging local embeddings in conjunction with existing deep learning techniques ties back to the current trends in machine learning and data integration, showcasing a cohesive evolution of ideas in the field.

The technical terms and their abbreviations used in this study are given in Table 3.

6. Conclusions

The need for a large space and a significant amount of time are the major challenges for finding relevant datasets from data portals. In this study, we proposed an efficient storage framework for recommending the relevant datasets of a query dataset. In our proposed system, we introduced the algorithm for generating the concept matrix and dataset matrix to curate the ranked list of the datasets by comparing the concept vectors of all the registered datasets in response to the user’s search query. Moreover, we have proposed matrix compression, indexing, and caching schemes to reduce the required storage and time while searching the related ranked list of the datasets. In addition to that we employed the domain knowledge of ontology while searching the dataset for semantic search. To verify the efficiency of our proposed system, we analyzed the precision values, recall, accuracy, F1-score, and time. From the experimental results, it can be seen that our proposed approach is better in terms of search efficiency and storage by utilizing caching, indexing, and matrix compression techniques. One limitation of the proposed method is its reliance on structural similarity alone to assess relationships between concepts. This approach may overlook nuanced, context-dependent meanings, which could reduce accuracy in cases where semantic relationships cannot be adequately represented by structural features alone. To address this, future work will incorporate additional semantic factors that capture complex, context-sensitive relationships, potentially enhancing accuracy in cases where structural similarity alone is insufficient. We also aim to improve our system by employing other semantic relations, such as semantically related link structures, and to refine the interpretation of user search intent to capture semantic subtleties within user queries. Finally, to ensure practical usability, we will expand our evaluation to include usability testing and assessments of user satisfaction with the quality and relevance of retrieved datasets. This will involve gathering feedback from domain experts and end-users through surveys and interviews to understand how effectively the system meets user expectations. By continuously refining the system based on user interaction and feedback, we aim to ensure its applicability in diverse contexts and real-world scenarios.

Author Contributions

Conceptualization, T.S., U.Q., M.U., and M.D.H.; Project administration, M.D.H.; Software, T.S.; Supervision, M.D.H.; Writing—original draft, T.S., U.Q., M.U., and M.D.H.; Writing—review and editing, T.S., U.Q., M.U., and M.D.H. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. Domain category graph.

Figure 2. Methodology process diagram of the proposed system.

Figure 3. The Concept Matrix for the Transport ontology.

Figure 4. Flow chart for generating the concept matrix.

Figure 5. System to calculate the similarity.

Figure 6. The TF-IDF graph between the dataset and concept having the semantic threshold = 0.8.

Figure 7. The Dataset Matrix for the Transport ontology.

Figure 8. Flow chart for generating the Dataset Matrix.

Figure 9. Flow chart for compression.

Figure 10. The proposed cache architecture.

Figure 11. Performance analysis of our proposed scheme in terms of existing schemes; (a) Precision, (b) Recall, (c) F1-score, and (d) Accuracy.

Figure 12. The overall result of the five query datasets on the proposed algorithm.

Figure 13. The compression during runtime.

Table 1

The set of query datasets used for the experiment.

Query_Id	Details
1.	URL	:transport-road-transport-in-europe
	Title	Transport Road Transport in Europe
	Description	Road transport statistics for European countries. This dataset was prepared by Google based on data downloaded from Eurostat.
2.	URL	:transport-exports-by-mode-of-transport-1966
	Title	Transport Exports by Mode of Transport, 1966
	Description	License Rights under which the catalog can be reused are outlined in the Open Government License - Canada Available download formats from providers jpg, pdf Description Contained within the 4th Edition (1974) of the Atlas of Canada is a graph and two maps.
3.	URL	:transport-bus-breakdown-and-delays
	Title	Transport Bus Breakdown and Delays
	Description	The Bus Breakdown and Delay system collects information from school bus vendors operating out in the field in real time.
4.	URL	:transport-motor-vehicle-output-truck-output
	Title	Transport Motor vehicle output: Truck output
	Description	Graph and download economic data for motor vehicle output: Truck output (A716RC1A027NBEA) from 1967 to 2018 about output, trucks, vehicles, GDP, and USA.
5.	URL	:trans-national-public-transport-data-repository-nptdr
	Title	National Public Transport Data Repository (NPTDR)
	Description	The NPTDR database contains a snapshot of every public transport journey in Great Britain for a selected week in October each year.

Table 2

The average time breakout of our proposed scheme for the five queries.

Task	Average Time (s)	Standard Deviation
Index search	0.70	0.20
Decompression of rows	0.25	0.13
DM row calculation	0.50	0.30
Total time without cache(s)	1.40	0.63
Total time with cache hit(s)	∼0.02

Table 3

Technical terms and their abbreviations; N/A = Not Applicable.

Technical Term	Abbreviation	Description
Ontology	N/A	A formal representation of a set of concepts within a domain and the relationships between them.
Least Common Subsumer	LCS	The lowest node in a taxonomy that is a hypernym of two concepts.
Wu and Palmer Similarity	N/A	A method to compute the relatedness of two concepts by considering the depth of the synsets and their least common subsumer.
Concept Matrix	CM	A matrix where each element represents the structural similarity between two concepts in an ontology.
Taxonomy	N/A	A hierarchical structure of categories or concepts.
Semantic Web	N/A	A framework that allows data to be shared and reused across application, enterprise, and community boundaries.
Resource Description Framework	RDF	A framework for representing information about resources on the web.
Simple Knowledge Organization System	SKOS	A common data model for sharing and linking knowledge organization systems via the web.
Hypernym	N/A	A word with a broad meaning that more specific words fall under; for example, “vehicle“ is a hypernym of “car”.
Structural Similarity	N/A	A measure of how similar two concepts are based on the structure of the ontology.
Information Content	N/A	The amount of information a concept contains, often used to calculate similarities in ontologies.
Synset	N/A	A set of one or more synonyms that are interchangeable in some context.
Path Length	N/A	The shortest distance between two concepts in an ontology, often measured in the number of edges.
Hierarchical Structure	N/A	A system of elements ranked one above another, typically seen in ontologies.
Similarity Score	N/A	A numerical value representing the similarity between two concepts.
Graph-based Representation	N/A	A way to model data where entities are nodes, and relationships are edges in a graph.
Ontology Matching	N/A	The process of finding correspondences between semantically related entities in different ontologies.
Least Number of Edges	N/A	The minimum number of edges between two nodes (concepts) in a graph or ontology.
Knowledge Base	KB	A database that stores facts and rules about a domain, used for reasoning and inference.
Natural Language Processing	NLP	A field of AI that focuses on the interaction between computers and humans using natural language.

References

1. Hendler, J.; Holm, J.; Musialek, C.; Thomas, G. US government linked open data: Semantic. data. gov. IEEE Intell. Syst.; 2012; 27, pp. 25-31. [DOI: https://dx.doi.org/10.1109/MIS.2012.27]

2. Kassen, M. A promising phenomenon of open data: A case study of the Chicago open data project. Gov. Inf. Q.; 2013; 30, pp. 508-513. [DOI: https://dx.doi.org/10.1016/j.giq.2013.05.012]

3. Burwell, S.M.; VanRoekel, S.; Park, T.; Mancini, D.J. Open data policy—Managing information as an asset. Exec. Off. Pres.; 2013; 13, 13.

4. Brickley, D.; Burgess, M.; Noy, N. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. Proceedings of the World Wide Web Conference; San Francisco, CA, USA, 13–17 May 2019; pp. 1365-1375.

5. Bizer, C.; Volz, J.; Kobilarov, G.; Gaedke, M. Silk-a link discovery framework for the web of data. Proceedings of the 18th International World Wide Web Conference. Citeseer; Madrid, Spain, 20–24 April 2009; Volume 122.

6. Suchanek, F.M.; Abiteboul, S.; Senellart, P. Paris: Probabilistic alignment of relations, instances, and schema. arXiv; 2011; arXiv: 1111.7164[DOI: https://dx.doi.org/10.14778/2078331.2078332]

7. Azoff, E.M. Neural Network Time Series Forecasting of Financial Markets; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1994.

8. Chapman, A.; Simperl, E.; Koesten, L.; Konstantinidis, G.; Ibáñez, L.D.; Kacprzak, E.; Groth, P. Dataset search: A survey. VLDB J.; 2020; 29, pp. 251-272. [DOI: https://dx.doi.org/10.1007/s00778-019-00564-x]

9. Maier, D.; Megler, V.; Tufte, K. Challenges for dataset search. Proceedings of the International Conference on Database Systems for Advanced Applications; Bali, Indonesia, 21–24 April 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 1-15.

10. Castelo, S.; Rampin, R.; Santos, A.; Bessa, A.; Chirigati, F.; Freire, J. Auctus: A dataset search engine for data discovery and augmentation. Proc. VLDB Endow.; 2021; 14, pp. 2791-2794. [DOI: https://dx.doi.org/10.14778/3476311.3476346]

11. Sultana, T.; Lee, Y.K. gRDF: An Efficient Compressor with Reduced Structural Regularities That Utilizes gRePair. Sensors; 2022; 22, 2545. [DOI: https://dx.doi.org/10.3390/s22072545]

12. Sultana, T.; Lee, Y.K. Efficient rule mining and compression for RDF style KB based on Horn rules. J. Supercomput.; 2022; 78, pp. 16553-16580. [DOI: https://dx.doi.org/10.1007/s11227-022-04519-y]

13. Sultana, T.; Lee, Y.K. Expressive rule pattern based compression with ranking in Horn rules on RDF style kb. Proceedings of the 2021 IEEE International Conference on Big Data and Smart Computing (BigComp); Jeju Island, Republic of Kore, 17–20 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 13-19.

14. Slimani, T. Description and evaluation of semantic similarity measures approaches. arXiv; 2013; arXiv: 1310.8059[DOI: https://dx.doi.org/10.5120/13897-1851]

15. Hagelien, T.F. A Framework for Ontology Based Semantic Search. Master’s Thesis; NTNU: Trondheim, Norway, 2018.

16. Jiang, S.; Hagelien, T.F.; Natvig, M.; Li, J. Ontology-based semantic search for open government data. Proceedings of the 2019 IEEE 13th International Conference on Semantic Computing (ICSC); Newport Beach, CA, USA, 30 January–1 February 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 7-15.

17. Rasel, M.K.; Elena, E.; Lee, Y.K. Summarized bit batch-based triangle listing in massive graphs. Inf. Sci.; 2018; 441, pp. 1-17. [DOI: https://dx.doi.org/10.1016/j.ins.2018.02.018]

18. Wu, D.; Wang, H.T.; Tansel, A.U. A survey for managing temporal data in RDF. Inf. Syst.; 2024; 122, 102368. [DOI: https://dx.doi.org/10.1016/j.is.2024.102368]

19. Arenas-Guerrero, J.; Iglesias-Molina, A.; Chaves-Fraga, D.; Garijo, D.; Corcho, O.; Dimou, A. Declarative generation of RDF-star graphs from heterogeneous data. Semant. Web; 2024; pp. 1-19. [DOI: https://dx.doi.org/10.3233/SW-243602]

20. Sultana, T.; Hossain, M.D.; Morshed, M.G.; Afridi, T.H.; Lee, Y.K. Inductive autoencoder for efficiently compressing RDF graphs. Inf. Sci.; 2024; 662, 120210. [DOI: https://dx.doi.org/10.1016/j.ins.2024.120210]

21. Ngomo, A.C.N.; Auer, S. LIMES—A time-efficient approach for large-scale link discovery on the web of data. Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence; Barcelona, Spain, 16–22 July 2011.

22. Araujo, S.; Tran, D.T.; de Vries, A.P.; Schwabe, D. SERIMI: Class-based matching for instance matching across heterogeneous datasets. IEEE Trans. Knowl. Data Eng.; 2014; 27, pp. 1397-1440. [DOI: https://dx.doi.org/10.1109/TKDE.2014.2365779]

23. Araújo, T.B.; Stefanidis, K.; Santos Pires, C.E.; Nummenmaa, J.; Da Nóbrega, T.P. Schema-agnostic blocking for streaming data. Proceedings of the 35th Annual ACM Symposium on Applied Computing; Brno, Czech Republic, 30 March–3 April 2020; pp. 412-419.

24. Nikolov, A.; Uren, V.; Motta, E.; Roeck, A.d. Integration of semantically annotated data by the KnoFuss architecture. Proceedings of the International Conference on Knowledge Engineering and Knowledge Management; Acitrezza, Italy, 29 September–2 October 2008; Springer: Berlin/Heidelberg, Germany, 2008; pp. 265-274.

25. Efthymiou, V.; Papadakis, G.; Stefanidis, K.; Christophides, V. MinoanER: Schema-agnostic, non-iterative, massively parallel resolution of web entities. arXiv; 2019; arXiv: 1905.06170

26. Papadakis, G.; Tsekouras, L.; Thanos, E.; Pittaras, N.; Simonini, G.; Skoutas, D.; Isaris, P.; Giannakopoulos, G.; Palpanas, T.; Koubarakis, M. JedAI3: Beyond batch, blocking-based Entity Resolution. Proceedings of the EDBT; Copenhagen, Denmark, 30 March–2 April 2020; pp. 603-606.

27. Pelgrin, O.; Galárraga, L.; Hose, K. Towards fully-fledged archiving for RDF datasets. Semant. Web; 2021; pp. 1-24. [DOI: https://dx.doi.org/10.3233/SW-210434]

28. De Meester, B.; Heyvaert, P.; Arndt, D.; Dimou, A.; Verborgh, R. RDF graph validation using rule-based reasoning. Semant. Web; 2021; 12, pp. 117-142. [DOI: https://dx.doi.org/10.3233/SW-200384]

29. Kettouch, M.S.; Luca, C. LinkD: Element-based data interlinking of RDF datasets in linked data. Computing; 2022; 104, pp. 2685-2709. [DOI: https://dx.doi.org/10.1007/s00607-022-01107-z]

30. Deepak, G.; Santhanavijayan, A. OntoBestFit: A best-fit occurrence estimation strategy for RDF driven faceted semantic search. Comput. Commun.; 2020; 160, pp. 284-298. [DOI: https://dx.doi.org/10.1016/j.comcom.2020.06.013]

31. Niazmand, E.; Sejdiu, G.; Graux, D.; Vidal, M.E. Efficient semantic summary graphs for querying large knowledge graphs. Int. J. Inf. Manag. Data Insights; 2022; 2, 100082. [DOI: https://dx.doi.org/10.1016/j.jjimei.2022.100082]

32. Ferrada, S.; Bustos, B.; Hogan, A. Extending SPARQL with Similarity Joins. Proceedings of the International Semantic Web Conference; Athens, Greece, 2–6 November 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 201-217.

33. Sultana, T.; Hossain, M.D.; Umair, M.; Khan, M.N.; Alam, A.; Lee, Y.K. Graph pattern detection and structural redundancy reduction to compress named graphs. Inf. Sci.; 2023; 647, 119428. [DOI: https://dx.doi.org/10.1016/j.ins.2023.119428]

34. Umair, M.; Sultana, T.; Lee, Y.-K. Pre-Trained Language Models for Keyphrase Prediction: A Review. ICT Express; 2024; 10, pp. 871-890. [DOI: https://dx.doi.org/10.1016/j.icte.2024.05.015]

35. Lorenzi, F.; Bazzan, A.L.; Abel, M.; Ricci, F. Improving recommendations through an assumption-based multiagent approach: An application in the tourism domain. Expert Syst. Appl.; 2011; 38, pp. 14703-14714. [DOI: https://dx.doi.org/10.1016/j.eswa.2011.05.010]

36. Salehi, M.; Kmalabadi, I.N. A hybrid attribute–based recommender system for e–learning material recommendation. Ieri Procedia; 2012; 2, pp. 565-570. [DOI: https://dx.doi.org/10.1016/j.ieri.2012.06.135]

37. Kardan, A.A.; Ebrahimi, M. A novel approach to hybrid recommendation systems based on association rules mining for content recommendation in asynchronous discussion groups. Inf. Sci.; 2013; 219, pp. 93-110. [DOI: https://dx.doi.org/10.1016/j.ins.2012.07.011]

38. Miles, A.; Bechhofer, S. SKOS simple knowledge organization system reference. W3C Recommendation; 2009; Available online: https://www.w3.org/TR/skos-reference/ (accessed on 4 November 2024).

39. Göğebakan, K.; Ulu, R.; Abiyev, R.; Şah, M. A drug prescription recommendation system based on novel DIAKID ontology and extensive semantic rules. Health Inf. Sci. Syst.; 2024; 12, 27. [DOI: https://dx.doi.org/10.1007/s13755-024-00286-7]

40. Oliveira, F.d.; Oliveira, J.M.P.d. A RDF-based graph to representing and searching parts of legal documents. Artif. Intell. Law; 2024; 32, pp. 667-695. [DOI: https://dx.doi.org/10.1007/s10506-023-09364-9]

41. Kim, J.; Lee, S.W. The Ontology Based, the Movie Contents Recommendation Scheme, Using Relations of Movie Metadata. J. Intell. Inf. Syst.; 2013; 19, pp. 25-44.

42. Lee, W.P.; Chen, C.T.; Huang, J.Y.; Liang, J.Y. A smartphone-based activity-aware system for music streaming recommendation. Knowl.-Based Syst.; 2017; 131, pp. 70-82. [DOI: https://dx.doi.org/10.1016/j.knosys.2017.06.002]

43. Dong, H.; Hussain, F.K.; Chang, E. A service concept recommendation system for enhancing the dependability of semantic service matchmakers in the service ecosystem environment. J. Netw. Comput. Appl.; 2011; 34, pp. 619-631. [DOI: https://dx.doi.org/10.1016/j.jnca.2010.11.010]

44. Mohanraj, V.; Chandrasekaran, M.; Senthilkumar, J.; Arumugam, S.; Suresh, Y. Ontology driven bee’s foraging approach based self adaptive online recommendation system. J. Syst. Softw.; 2012; 85, pp. 2439-2450. [DOI: https://dx.doi.org/10.1016/j.jss.2011.12.018]

45. Chen, R.C.; Huang, Y.H.; Bau, C.T.; Chen, S.M. A recommendation system based on domain ontology and SWRL for anti-diabetic drugs selection. Expert Syst. Appl.; 2012; 39, pp. 3995-4006. [DOI: https://dx.doi.org/10.1016/j.eswa.2011.09.061]

46. Torshizi, A.D.; Zarandi, M.H.F.; Torshizi, G.D.; Eghbali, K. A hybrid fuzzy-ontology based intelligent system to determine level of severity and treatment recommendation for Benign Prostatic Hyperplasia. Comput. Methods Programs Biomed.; 2014; 113, pp. 301-313. [DOI: https://dx.doi.org/10.1016/j.cmpb.2013.09.021] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/24184111]

47. Wang, X.; Cheng, G.; Lin, T.; Xu, J.; Pan, J.Z.; Kharlamov, E.; Qu, Y. PCSG: Pattern-coverage snippet generation for RDF datasets. Proceedings of the International Semantic Web Conference; Virtual, 24–28 October 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 3-20.

48. Obe, R.O.; Hsu, L.S. PostgreSQL: Up and Running: A Practical Guide to the Advanced Open Source Database; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2017.

49. Velasco, R. Apache Solr: For Starters, 2016. Available online: https://dl.acm.org/doi/10.5555/3126424 (accessed on 4 November 2024).

50. Robertson, S.; Zaragoza, H. The probabilistic relevance framework: BM25 and beyond. Found. Trends® Inf. Retr.; 2009; 3, pp. 333-389. [DOI: https://dx.doi.org/10.1561/1500000019]

51. Liu, T.Y. Learning to rank for information retrieval. Found. Trends® Inf. Retr.; 2009; 3, pp. 225-331. [DOI: https://dx.doi.org/10.1561/1500000016]

52. Wu, Z.; Palmer, M. Verb semantics and lexical selection. arXiv; 1994; arXiv: cmp-lg/9406033

53. Wu, H.C.; Luk, R.W.P.; Wong, K.F.; Kwok, K.L. Interpreting tf-idf term weights as making relevance decisions. ACM Trans. Inf. Syst. (TOIS); 2008; 26, pp. 1-37. [DOI: https://dx.doi.org/10.1145/1361684.1361686]

54. Muja, M.; Lowe, D.G. Fast approximate nearest neighbors with automatic algorithm configuration. VISAPP; 2009; 2, 2.

55. Silpa-Anan, C.; Hartley, R. Optimised KD-trees for fast image descriptor matching. Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition; Anchorage, Alaska, 23–28 June 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1-8.

56. Shah, K.; Mitra, A.; Matani, D. An O (1) algorithm for implementing the LFU cache eviction scheme. No; 2010; 1, pp. 1-8.

57. Eklov, D.; Hagersten, E. StatStack: Efficient modeling of LRU caches. Proceedings of the 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS); White Plains, NY, USA, 28–30 March 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 55-65.

58. Luo, L.; Guo, D.; Ma, R.T.; Rottenstreich, O.; Luo, X. Optimizing bloom filter: Challenges, solutions, and comparisons. IEEE Commun. Surv. Tutorials; 2018; 21, pp. 1912-1949. [DOI: https://dx.doi.org/10.1109/COMST.2018.2889329]

59. Leacock, C.; Chodorow, M. Combining local context and WordNet similarity for word sense identification. WordNet Electron. Lex. Database; 1998; 49, pp. 265-283.

60. Scheuerman, M.K.; Hanna, A.; Denton, E. Do datasets have politics? Disciplinary values in computer vision dataset development. Proc. ACM Hum.-Comput. Interact.; 2021; 5, pp. 1-37. [DOI: https://dx.doi.org/10.1145/3476058]

61. Buraga, S.C.; Amariei, D.; Dospinescu, O. An owl-based specification of database management systems. Comput. Mater. Contin; 2022; 70, pp. 5537-5550. [DOI: https://dx.doi.org/10.32604/cmc.2022.021714]

62. Cappuzzo, R.; Papotti, P.; Thirumuruganathan, S. Creating embeddings of heterogeneous relational datasets for data integration tasks. Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data; Portland, OR, USA, 14–19 June 2020; pp. 1335-1349.

Word count: 12980

Show less

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Governments are embracing an open data philosophy and making their data freely availableto the public to encourage innovation and increase transparency. However, the number of availabledatasets is still limited. Finding relationships between related datasets on different data portalsenables users to search the relevant datasets. These datasets are generated from the training data,which need to be curated by the user query. However, relevant dataset retrieval is an expensiveoperation due to the preparation procedure for each dataset. Moreover, it requires a significantamount of space and time. In this study, we propose a novel framework to identify the relationshipsbetween datasets using structural information and semantic information for finding similar datasets.We propose an algorithm to generate the Concept Matrix (CM) and the Dataset Matrix (DM) fromthe concepts and the datasets, which is then used to curate semantically related datasets in responseto the users’ submitted queries. Moreover, we employ the proposed compression, indexing, andcaching algorithms in our proposed scheme to reduce the required storage and time while searchingthe related ranked list of the datasets. Through extensive evaluation, we conclude that the proposedscheme outperforms the existing schemes.

Details

Title

An Efficient Framework for Finding Similar Datasets Based on Ontology

Author

Sultana, Tangina¹

; Qudus, Umair²

; Umair, Muhammad³

; Hossain, Md Delowar⁴

¹ Department of Computer Science and Engineering, Kyung Hee University, Yongin-si 17104, Republic of Korea; [email protected] (T.S.); [email protected] (M.U.); Department of Electronics and Communication Engineering, Hajee Mohammad Danesh Science & Technology University, Dinajpur 5200, Bangladesh
² Department of Computer Science, Paderborn University, Warburger Str. 100, 33098 Paderborn, Germany; [email protected]
³ Department of Computer Science and Engineering, Kyung Hee University, Yongin-si 17104, Republic of Korea; [email protected] (T.S.); [email protected] (M.U.)
⁴ Department of Computer Science and Engineering, Kyung Hee University, Yongin-si 17104, Republic of Korea; [email protected] (T.S.); [email protected] (M.U.); Department of Computer Science and Engineering, Hajee Mohammad Danesh Science & Technology University, Dinajpur 5200, Bangladesh

First page

4417

Publication year

2024

Publication date

2024

Publisher

MDPI AG

e-ISSN

20799292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/electronics13224417

ProQuest document ID

3133009411