Content area
To date, most large corporations still have their core solutions on relational databases but only use non-relational (i.e. NoSQL) database management systems (DBMS) for their non-core systems that favour availability and scalability through partitioning while trading off consistency. NoSQL systems are built based on the CAP (i.e., Consistency, Availability and Partitioning) database theorem, which trades off one of these features while maintaining the others. The need for systems availability and scalability drives the use of NoSQL, while the lack of consistency and robust query engines as obtainable in relational databases, impede their usage. To mitigate these drawbacks, researchers and companies like Amazon, Google, and Facebook run ’SQL over NoSQL’ systems such as Dynamo, Google’s Spanner, Memcache, Zidian, Apache Hive and SparkSQL. These systems create a query engine layer over NoSQL systems but suffer from data redundancy and lack consistency obtainable in relational DBMS. Also, their query engine is not relational complete because they cannot process all relational algebra-based queries as obtainable in a relational database. In this paper, we present a ’Unique NoSQL over SQL Database’ (UniqueNOSD) system, an extension of NOSD and an inverse of existing approaches. This approach is motivated by the need for existing systems to fully deploy NoSQL data store functionalities without the limitation of building an extra SQL layer for querying. To allow appropriate storage and retrieval of data on document-based NoSQL databases without data redundancy and inconsistency while encouraging both horizontal and vertical partitioning, we propose NoSQL over SQL Block as a Value (
Introduction
In this section, we discuss some preliminaries including the motivations behind the invention of NoSQL databases (Section "Background") and a short description of the different types of NoSQL databases that exist (Section "NoSQL Databases and their Forms") to lay the foundations for our work.
Background
Relational databases allowing data to be modelled and saved in rows and columns have gained popularity and wide adoption by research and industry communities due to their strict structuredness and robust query engines, provable by relational algebra. The amount of data generated by applications today, such as Amazon, YouTube, Facebook, and Twitter, is not only large in size but in different forms and changes at a very high frequency. This results in the need for a new set of databases that can scale at large as the data demands while handling fast-changing, and unstructured datasets [1]. The features of these systems generated data, such as large size (Volume), the speed at which the data is required to be processed or becomes accessible (Velocity), varying structures and formats (Variety), continuously changing meaning for the same data (Variability) give rise to newer data processing and analysis problems. Additionally, data generated by systems today are required to have a sufficient level of accuracy (referred to as Veracity), and organizations must be able to derive Value from such data if it is to be retained in the system or if the system’s resources will be expended on such data. Due to the volume, velocity, veracity, value and other features mentioned above, there is a need for these data to be Visualizable. This necessitates the need for a new set of tools that will be able to handle the features of data stated above (i.e., Volume, Velocity, Variety, Variability, Veracity, Value and Visualizable) [2]. These features, referred to as the seven (7) Vs of Big data, make the use of relational databases ineffective in storing such data [3]. The creation of distributed database systems is guided by three features namely: a. Consistency, b. Availability and c. Partition Tolerance. These three features bring about a database creation trade-off theorem known as the CAP Theorem. Cap Theorem implies that it is impossible to achieve all three properties of consistency, availability, and partition tolerance at the same time in a distributed system with data replication, [4].
NoSQL, a term which means Not SQL or Not only SQL [5], describes a variation of big data systems that store data in a non-relational format. Currently, there are several types of NoSQL database systems including Graph databases (such as Sparksee/DEX [6], TigerGraph [7], GraphBase [8], Memgraph [9, 10], Neo4j [11], PGX [12]), Document-based NoSQL databases (such as OrientDB [13], ArangoDB [14], Azure Cosmos DB [15], FaunaDB [16] and ElasticSearch [17]), Key-Value stores (such as Amazon’s Dynamo [18], Facebook’s Memchache [19], and Googles’s Bigtable [20]). All these variations of NoSQL systems are built based on the CAP theorem, enforcing Availability and Partition tolerance while trading off Consistency. Presented below are the design concepts of the different types of NoSQL systems. While our system achieves the three features not achieved by the previous systems, we explain existing systems and their respective contributions and limitations under the related works section.
NoSQL databases and their forms
Graph NoSQL stores
Numerous computing issues, including those in health, telecommunication, transportation and other areas, are modelled by graphs. This prompts the need for a specialized database for storing graphs. In a graph database, data is represented as graphs comprising vertices (i.e., graph nodes) and edges. Related nodes can be discovered by traversing the nodes using graph path expression. A path expression is often used to describe a query, which will then extract data from the graph that matches the query. In most cases, a path is defined by a start node, one or more relationships, and one or more end nodes that match the pattern. The foundation for graph databases is the graph model of G(V, E).
A graph G can be modelled as a tuple (V, E) where V is a set of vertices and is a set of edges of the graph. Also, and are denoted as the cardinality of vertices and edges of the graph respectively. In practice, there are two types of graphs based on the types of data they handle (i.e. directed and undirected graphs), which is an inevitable factor in graph database model designs. A practical example of a directed graph is in the usage of maps in the transportation sector. In this context, a graph representing an object/vehicular movement from point A to B, or in computer networks from a source IP/MAC address to a destination IP/MAC address is said to be directed, while a graph data representation of Facebook users is said to be undirected since if Person A is a friend to Person B, person B is as well a friend of person A. Furthermore, this is unlike the Twitter social network, in which if user A follows another user B does not imply user B follows A (Directed graph). For a directed graph G, an edge is a tuple of two vertices, where u is the out-vertex (also denoted as the source vertex, e.g. source IP/Mac address or person A in the example above), and v is the in-vertex (also regarded as the target/destination vertex, e.g. Destination IP/Mac address or Person B in the above example). For an undirected graph G, an edge is a set of two vertices without the need to store the direction data. Likewise, a graph may be weighted to show the strength of the relationship between the two vertices depicted. Hence, a weighted graph G is modelled with a triple variable as G(V, E, w); where w maps edge to weights.
In graph databases, the two fundamental structures that are used to depict the vertex-to-vertex connectivity structure (i.e. Vertex neighbourhood representations) are the Adjacency Matrix Format and the Adjacency List Format. A detailed example of the existing systems using these approaches and their limitations is provided in Section "Related works". The major advantage of graph databases is their ability to store connected data with their associated properties and computations, while their limitations range from their inability to fully support database ACID (i.e. Atomicity, Concurrency, Isolation and Durability) properties to complicated partitioning procedures due to having heavily interconnected data [21] and inability to store data of different forms.
Document-oriented NoSQL stores
Document-based NoSQL database stores are one of the most popular databases. They are designed to handle data without following the conventional normalization theories. Hence, they are able to handle semi-structured data. The ability of document-based NoSQL databases to handle semi-structured data and not require a specific schema associated with the data results in the complexity of their designs [22]. Generally, Document-based NoSQL databases store data in files in textual formats, either as JSON or XML. The data are associated with their respective internally represented file IDs and retrieved using the file ID or textual search such as Lucene with keywords. Examples of existing document database stores are BaseX [23], OrientDB [13], ArangoDB [14], Azure Cosmos DB [15], FaunaDB [16] and ElasticSearch [17]. Document-oriented databases are designed to manage large amounts of data with high database partitioning ability, scalability and availability. The major drawbacks of these types of databases include the inability to achieve the ACID property known with traditional relational databases, particularly consistency due to high partitioning in its design. Also, another drawback is their inability to handle complex queries. Existing document-oriented databases are primarily queried using API (Application Programming Interface), which requires high technical expertise. Works that attempt to solve the complex query’s drawback of document-oriented database stores are based on building an SQL layer on its front end. These systems, such as MongoDB Atlas [24], Memcache [19], and Google’s Spanner [25], are able to provide a query interface but still suffer from both inability to achieve ACID and handle complex queries because they are not based on relational algebra and calculus. Thus, they are not able to handle complex relational queries as obtainable in existing relational databases. Additionally, MongoDB Atlas can only perform SQL read operations and must be executed on MongoDB’s federated database instances [24].
Key-value stores
This is a simple data model NoSQL database, in which a key is linked to each data value in a dictionary-like method. It is made to store, retrieve, and manage data as a set of distinctive identifiers with related values. Its primary data structure is referred to as a key-value pair. In this design, there is only one value per key in the collection. The key serves as the distinctive identifier for each entry, and the accompanying value is either the information that the key identifies or its location.
Key Value databases such as Amazon’s Dynamo [18] and Google’s Bigtable [20] are designed to sacrifice consistency for availability and partitioning. The typical query supported by most Key Value NoSQL databases are PUT, GET and DELETE. Hence, they are deployed on systems that do not require complex query executions and data consistency but high availability and partitioning.
Object databases
This NoSQL database allows database designers to specify complex data structures with the help of object-oriented paradigms. The ability that object databases enable designers to specify both the structure of complex objects and the operations that can be done to these objects is a significant characteristic of these systems. The necessity to couple object-oriented programming languages with a database gives rise to object-oriented databases. Despite the fact that object-oriented databases have existed since the late 1970 s, relational databases and functional programming languages have become more commonplace in recent years, leading to relatively low adoption of object-oriented databases. An object-oriented database (OOD) is a type of database that can handle sophisticated data objects that are similar to those found in object-oriented programming languages. Every element is a separate object in object-oriented programming (OOP). Many items are also highly complicated, with several different qualities (called properties or attributes in OOP) and functions (called methods). To make the storing and retrieval of object-oriented data easier, object-oriented database management systems integrate with OOP and existing databases such as MongoDB [14] create a way to store data in OOP format, while many advanced programming languages such as Java and C# have data retrieval libraries supporting OOP paradigms in the form of object-relational mappers [26, 27]. The major limitations of OOD include the lack of a universal data model, lack of structured query, and less database partitioning support due to implementation complexity, thereby leading to low adoption.
Hybrid Database
This type of NoSQL database has a mixture of the characteristics of two or more document-based, key-values, OOD, and graph-based. Hybrid NoSQL database systems are designed to make compromises in order to optimize the CAP theorem for their applications. Although a hybrid database system may employ several types of database software and utilize their characteristics for certain applications and workloads, it is still difficult to get the database software design to incorporate the three features of CAP, in order to obtain the best performance. Existing hybrid NoSQL systems such as Azure Cosmos DB [15] (Combining relational and document-oriented models), and ArangoDB [14] (combining document, graph and key-value data models [28]) still have to trade off the consistency feature of CAP and are not able to store data uniquely, leading to data redundancy issues [4]. Most hybrid database management systems are primarily accessed using API [28], while a few such as Azure Cosmos DB [15] and OrientDB [13] provide SQL query engines for querying their relational data [29].
Contributions
In our previous work of NOSD [30], the proposed BaaV model improves data retrieval time with data storage in a relational-like format on document-based NoSQL databases. In this work, we propose UniqueNOSD, an extension of NOSD [30] by (i) Enforcing data integrity and consistency by unique storage of data on document-oriented NoSQL database, to reduce redundancy as demonstrated in Section "Operations on UniqueNOSD" of this paper, (ii) Extending the data conversion process from non-textual to a textual format, by using a faster and less storage-consuming symmetric data encoding technique of FasterBase64 [31], instead of the ordinary base64 algorithm [32] used in NOSD [30], (iii) We extend and demonstrate using relational algebra-based queries, that UniqueNOSD can handle relational-like query operations that were not handled by the earlier NOSD [30] (Section "Join operator"), (iv) Extensive and more detailed experiments by comparing UniqueNOSD with more state-of-the-art ’SQL over NoSQL’ systems such as SparkSQL [33] and Apache Hive [34]. This is in addition to Zidian2019 [35], NOSD [30] and relational database management systems of MSSQL and MySQL. Furthermore, our experimentation on UniqueNOSD includes more and different datasets that were not used for evaluating NOSD [30], such as Transaction Processing Council’s TPC-H [36] auto-generated datasets and queries [37], Canada Weather datasets, US Bureau of Transportation Statistics datasets and United Kingdom’s Ministry of Transportation datasets as explained in Sections "Experimental Evaluation" and "Results and Analysis" of this paper. These contributions are further divided into feature and procedural contributions as follows:
Feature contributions
The main feature contributions of this study are to:
Enable the automatic replication of existing relational data into a document-oriented database without data loss, and enforce data consistency and availability with both horizontal and vertical partitioning associated with NoSQL databases (Section "Data Replication from Relational Database to Document Oriented Database"), using our extended NOSDReplica algorithm, which ensures that already replicated data are not re-replicated.
Enable data storage on NoSQL database, using our extended NoSQL Block as a Value (BaaV) strategy to keep the integrity of replicated relational data (Section "Operations on UniqueNOSD").
Integrating videos, pictures and audio storage with textual data on NoSQL databases, using FasterBase64 format [31] without data loss and ensuring consistency without data redundancy.
Procedural contributions
To overcome the aforementioned limitations of relational databases and existing document-oriented NoSQL databases and meet the research goals of extending our previous work of NOSD[30], the procedural contributions according to the feature contributions are as follows:
Propose a model called UniqueNOSD that transforms a relational database into a document-oriented database using Block as a Value () strategy without data loss and improving retrieval time. Taking advantage of the scalability of document-oriented NoSQL databases, UniqueNOSD first denormalizes the relational database while retaining the relations’ primary and foreign keys as explained in Section "Proposed unique NOSQL over SQL block as a value system(UniqueNOSD)". Then, we use the BaaV strategy explained in Section "Operations on UniqueNOSD". We group relations related through foreign keys together (called Block of value in this study) while selecting a key for the Block through attributes functional dependencies explained in Section "Operations on UniqueNOSD" of this study to enforce data consistency.
Next, we propose and implement an extended data replication algorithm called that autonomously replicates data into the Document Oriented Database, explained in Section "Document-Oriented Databases" where each block is represented as a document with the corresponding id as the key to the block. This is similar to the Aggregate bottom-up (ABU) algorithm [38] originally designed to replicate data in a distributed multi-tier Data Grid dynamically.
We use a Faster and storage-consuming symmetric encoding technique of FasterBase64 [31] for converting videos and audio data objects into a textual format for onward storage in the document-oriented database. Thereby enhancing the ability to store audio and video data objects along with other data forms with minimal disk storage usage. Unlike the traditional relational database that saves images and audio as BLOB (Binary Large Objects) which is a collection of binary data represented as a single entity.
Extensive and more detailed experiments by comparing UniqueNOSD with more state-of-the-art ‘SQL over NoSQL’ systems such as SparkSQL [33] and Apache Hive [34] in addition to Zidian2019 [35], NOSD [30] and relational database management systems of MSSQL and MySQL. Furthermore, our experimentation on UniqueNOSD includes more and different datasets, such as Transaction Processing Council’s TPC-H [36] auto-generated datasets and queries, Canada Weather datasets, US Bureau of Transportation Statistics datasets and United Kingdom’s Ministry of Transportation datasets explained in Sections "Experimental Evaluation" and "Results and Analysis" of this paper.
Problem definition
This paper presents a ‘NoSQL over SQL’ system called UniqueNOSD, an inverse of existing approaches such as Zidian [35], SparkSQL [33] and Apache Hive [34]. This approach is motivated by the need for existing systems to deploy the full functionalities of NoSQL data stores without the limitation of having to build an extra SQL layer for querying or losing data integrity. This is to allow appropriate storage and retrieval of data on document-based NoSQL databases without data redundancy and inconsistency while encouraging both horizontal and vertical partition. UniqueNOSD approach is formalized in Section "Operations on UniqueNOSD".
To formally state the problem: Given a relational database with relations represented as , with a key attribute and is the primary key to the set of attributes of the relation, the task is to generate an equivalent document-oriented NoSQL database by using our proposed Block as a Value (Baav) denoted as a tuple (K, B) where K means key and B means block. In , we represent a relation as with a key attribute K and a set of n relations (i.e. r) called blocks B and each r contains a set of its own attributes and is denoted as with a key attribute k and a set of n attributes typical to a relational model. The relations in R of are related through foreign key relationships.
The rest of this paper is organized as follows: Section "Related work" summarizes related works about relational databases and NoSQL databases. Section "Proposed unique NOSQL over SQL block as a value system(UniqueNOSD)" introduces the proposed NoSQL over SQL model in detail with the architecture. Section "Experimental Evaluation" presents the experimental setup, and Section "Results and Analysis" presents the results and discussion. Lastly, Section "Conclusion and Future Work" presents the conclusion and future works.
Related work
NoSQL databases have been widely studied and several NoSQL systems [11, 13, 15, 16, 17, 18, 19–20, 23, 39, 40, 41–42] have been widely proposed with many different types adopted in the industry and research environments. However, the design of NoSQL databases is driven by the desire for databases to store data with fewer or no strict structure requirements, and high database availability and partition tolerance. In this paper, we explored related works in five categories. The first category [43, 44, 45, 46, 47–48] entails relational databases storing data as rows and columns in a tabular format in which relations (i.e., tables) are related through foreign key attributes (section "Relational Databases"). The second group [18, 19–20, 39, 40] is the Key Value NoSQL databases (section "Key Value Databases") which store databases based on a simple data model of having a key linked to each data value in a dictionary-like method. The third group [11, 41, 49, 50] comprises graph databases that store data as graphs comprising vertices (i.e., graph nodes) and edges (section "Graph-Based Databases"). The fourth category [13, 15, 16–17, 23] of related works discussed is the document-oriented NoSQL databases that store data in files in textual formats either as JavaScript Object Notations (JSON) or Extensible Markup Language (XML). The data are associated with their respective internally represented file IDs and retrieved using the file ID or textual search with keywords (section "Document-Oriented Databases"). The last category [33, 34–35, 51] of related works explored are SQL systems built on NoSQL databases to enable SQL over NoSQL databases (Section "SQL Over NoSQL Systems"). In this section, we will discuss these categories of related works.
Relational databases
Relational databases generally store data in tables (called relations) arranged in rows and columns and based on relational algebra and relational calculus theory. Several studies have been proposed on relational databases, particularly on the optimization of relational database query evaluation algorithms. Existing approaches include Logic-Based and Query Semantics Transformations [43, 44], Fast Implementation of SQL Basic Operations such as CUDA [45], Tenzing [46], and using Combinatorial or Heuristic Algorithms for generating alternative query execution plans [47, 48]. Relational databases are still the most popular databases and have gained prominence and wide adoption because of their ability to handle complex queries. Relational databases generally follow relational calculus and relational algebra rules. This model is primarily built on the well-established 3-value predicate calculus of relational algebra (3VPC) of True, False and Unknown (i.e., null under SQL query). The logic-based and semantic approach primarily uses interpretations of SQL expressions (i.e., interpretation of the results in the SELECT, FROM and WHERE clauses of the SQL queries).
Although several works have attempted to improve relational database performances, A major limitation of relational database and their corresponding database management system implementations such as MySQL [45], MSSQL [46], and Oracle [52] is that as the number of tuples in a relation increases, the retrieval time increases [53, 54–55]. Also, their inability to scale horizontally and efficiently store data of different forms (such as videos, audio, and text) and handle fast-changing data in terms of sizes, speed and context (i.e., the fast-changing meaning of data points) necessitated the use of NoSQL databases and are still prevalent research issues.
Key value databases
Key value databases are a type of non-relational (i.e., NoSQL) databases that use a simple method for their data storage. A key-value database is composed of a collection of key-value pairs in which the key acts as a unique identifier for the data. Keys and values can be any type of object, from basic objects to complex compound things. Existing key-value systems such as Amazon DynamoDB [18], Memcached [19], Google’s Bigtable [20], Apache Cassandra [39] and Squirrel [40] are designed to be very fast due to their in-memory implementations. Key-Value databases are designed to be simple yet powerful, particularly in storing cached data due to their in-memory implementation. Key-Value databases’ straightforward architecture supports speedy deployment, and simplicity of development, and overcomes numerous issues associated with huge data caches. The common use cases of Key-Value databases are in the storage of sessions for session-oriented applications and shopping carts for e-commerce systems such as Amazon [18]. Because the value maps to the key of a key-value database could be of any data type, the schema is implicitly defined per data entry. A typical example is shown in the example 1 below with user application data:
Example 1
Generating key-value database instance Given a simple school staff management relational database with the extracted relations: User, Course, CourseAssigned and Salary, with the schema:
,
, ,
,
Table 1. Instance of relation: User
UserId | Name | Position |
|---|---|---|
user1 | John | Instructor |
user2 | Bull | Teacher |
user3 | Peter | Teaching Assistant |
Table 2. Instance of relation: Course
CourseId | CourseName |
|---|---|
COMP9980 | Computer Architecture |
COM8150 | Advanced Database |
COMP8567 | System Programming |
Table 3. Instance of relation: CourseAssigned
UserId | CourseId |
|---|---|
user1 | Comp8150 |
user2 | Comp8567 |
user1 | comp9980 |
Table 4. Instance of relation: Salary
UserId | Amount |
|---|---|
user1 | 46,000 |
user2 | 50,000 |
user3 | 29,000 |
Provide a sample instance (i.e., sample data entry in the Key-Value database) of the relational database in Key-Value format.
Solution 1: Generating key-value database instance
Recall that key-value stores map a key to identify a value or collection of values uniquely. These values may be of any data type, and there are no strict rules on the schema of the data to be inserted but implicitly defined (i.e.decided) at the point of the insert operation as explained in section "Key Value Databases" above. Hence, a sample instance of the relational database given in example 1 is presented in Fig 1. Another instance could be when the userId attribute is used to uniquely identify any data object concerning the user, as shown in Fig 2. Hence, there is no specified schema, and the database designer chooses whatever to use as the key at the point of the data INSERT operation (i.e., PUT operation in NoSQL data stores):
[See PDF for image]
Fig. 1
Key-Value Instance of School Staff Management Database
[See PDF for image]
Fig. 2
Instance of Key-Value of School Staff Management Database
Generally, Key-Value databases provide a flexible yet simple data model, with high availability and ease of horizontal partitioning. Major setbacks for key-value databases include the inability to achieve data consistency similar to relational databases (Due to the CAP theorem discussed in section "CAPS Theorem" of this work). However, recent works such as Squirrel [40], and new versions of Amazon DynamoDB [18] and Apache Cassandra [39] attempt to solve the consistency issues with Key-Value database using eventual consistency algorithm [56]. However, support for eventual consistency is currently relatively restricted, as only a few duplicated data types are supplied. Furthermore, Key-Value database implementations basically support three query operations: GET (to retrieve data from the database), PUT (to insert data into the database) and DELETE (to remove data from the database) [18, 39], and are usually accessed through an Application Programming Interface-API(A mechanism that enables software communicates with each other, usually with endpoint link), a major limitation restricting its applicability due to the inability to handle complex queries and requirement for high technical expertise.
Graph-based databases
Several computational data and problems such as those in social networks and computer networks and others are modelled as graphs due to the connectivity between different instances of the objects in the database. Facebook for example stores the friendships between users using its Tao graph database [49, 50]. These are basically modelled as a graph G which can be modelled as a tuple (V, E) where V is a set of vertices and is a set of edges of the graph. Also, and are denoted as the cardinality of vertices and edges of the graph respectively. In graph databases, the two fundamental data structures that are used to depict the vertex-to-vertex connectivity(i.e. Vertex neighbourhood representations) are the Adjacency Matrix Format [57] and the Adjacency List Format [58].
To simplify our explanation vis-a-vis graph databases data models, merits and limitations of existing graph databases, we provide the Table 5 notations as a summary of symbols used in our explanations of graph databases in this section (section "Graph-Based Databases").
Table 5. Notations used to describe Graph-based NoSQL Database
Notations | Description |
|---|---|
G | A graph G is represented containing a set of vertices V and edges E. |
n | The total number of edges and vertices in a graph G, where . |
d is the average degree and is the maximum degrees in a given graph G. | |
This is the power set of the set S, i.e., . This is a set that includes each and every potential subset of S | |
AM is Adjacency matrix while is denoted as . This simply means that is used to represent the adjacency connection between using 1 and 0 bit | |
, AL | is the adjacency list of the vertex and AL is the adjacency list representation. |
LPG | Label Property Graph |
RDF | Resource Description Framework |
ACID | Transaction Guarantees for tomicity, onsistency, solation, and urability. |
Adjacency Matrix Format uses a matrix to establish vertices’ connections, i.e., if . However, in the AL design, each vertex (let’s say u) has a list of all of its adjacent vertices called Adjacency list . This adjacency list () keeps the IDs of all vertices that are adjacent to u. In the implementation of , it is stored as an array of IDs of vertices. This implies that if
Example 2
:Graph database representation with adjacency list Given an input graph data of Figure 3 represented by the adjacency matrix of Figure 4 representing the connection between the vertices of Figure 3. Obtain the corresponding graph database representation using an adjacency list.
[See PDF for image]
Fig. 3
Adjacency Matrix Graph
[See PDF for image]
Fig. 4
Adjacency Matrix Table of Vertices
Solution 2: Graph database representation with adjacency list
The corresponding Adjacency List and Edge List (Sorted) solutions are presented in Figures 5 and 6 respectively.
[See PDF for image]
Fig. 5
Adjacency List of Adjacency Matrix
[See PDF for image]
Fig. 6
Edge List of Adjacency Matrix
In Graph database implementations, there are two conceptual data models used which are extensions of the Adjacency list and edge list presented in example 2 above. These are the HyperGraph model [41] and the Labeled Property Graph Model (also known as the Property Graph) which serves as the foundation for the most prominent graph database system Neo4J [11].
A hypergraph data model is a generalization of the normal graph model (i.e., G(V, E)). In a hypergraph model, any of its edges can connect arbitrary quantities of vertices, and this informs why it is rarely used in most recent graph databases and graph processing systems. A Hypergraph database model is formally defined as a tuple (V, E) where V is a set of vertices, and ) (See Table 5 for symbol interpretation). Hence, it contains hyperedges which are non-empty sets of V. A few of its uses are in the HypergraphDB [41]. HypergraphDB uses the concepts of key-values (A dictionary way of storage in which each key is mapped to a corresponding value) as its building block, and each of these values is called an atom in HypergraphDB. Every atom has an ID that is cryptographically represented to enforce identical IDs for different graph elements in a distributed environment. In Hypergraph, vertices and hyperedges are represented as atoms, this implies that each vertex and hyperedge have their own unique IDs in Hypergraph model. Therefore, an atom of hyperedge normally stores the list of IDs of the vertices connected to it. For each vertex and hyperedge, there is also a data field called typeID or labelID that can be used to store values or additional properties iteratively in the graph database that could be referenced by a value ID (See Hypergraph sample representation in Figure 7). This recursive structure of typeID includes value IDs for binary data or value IDs for other atoms that are identified by other recursive structures.
[See PDF for image]
Fig. 7
HypergraphDB: Using Key-Value to Represent Hypergraph
Labelled Property Graph Model (LPG): This is an enriched version of the classical graph model (i.e. G(V, E)) [42]. The LPG model emanates due to the limitation of the classical model in solving certain graph data computational problems such as computing vertex centralities [59, 60], denoting why LPG is the most used data model in most of the existing graph databases [42]. To build the LPG data model, the classical graph model G(V, E) is supplemented with labels that specify various subsets (or classes) of vertices and edges. Then, each vertex and edge can have any number (1...n) of additional properties (i.e., label) called attributes (An attribute is a pair (key, value), where the key identifies the property and value represents its associated value).
Therefore, the Labeled Property Graph model is formally defined as a nine-element tuple , where V represent vertices, E edges, L is the set of labels (i.e. attributes), and . As stated in the summary notation in Table 5, is the power set of L, hence representing all viable subsets of L. As a result, a subset of labels is assigned to each vertex and edge. Next, any number of attributes can be attached to both a vertex and an edge. Recall that an attribute is represented as a key-value pair; thus, where keyK and valueZ respectively in the LPG tuple. Thereby making K and Z to be the set of all possible keys and values respectively. is the set of attribute’s (i.e., property/label) pairs of the vertex u, while represent the set of property of pairs of the edge e respectively. Example of existing graph database benchmark systems using the LPG model includes Neo4j [11, 61] and ArrangoDB [62].
Example 3
:Representation of instance using labelled property graph model
Given a simple followership relational database relation’s schema
with the instances in Table 6 and 7, obtain the equivalent graph database representation using the Labelled Property Graph Model.
Table 6. Instance of Person
Id | Name | Age |
|---|---|---|
0001 | John | 35 |
0002 | Smith | 67 |
0003 | Peter | 35 |
Table 7. Instance of Follow
Source−id | Destination−id | Followership−time |
|---|---|---|
0001 | 0002 | 2023:02:01:14:04:01 |
0002 | 0003 | 2023:07:01:14:04:01 |
Solution 3: Representation of instance using labelled property graph model
Figure 8 depicts the representation of the relational schema given in example 3. The image on the right side of Figure 8 is a representation of the same relational schema using Resource Description Framework (RDF) triple stores format. provides formats for exploring and displaying related data graphically [63]. The labelled property graph is solely for storing and querying, whereas the RDF model is primarily focused on data exchange.
[See PDF for image]
Fig. 8
Labelled Property Graph Representation of Followership Database
Graph databases are novel in solving computational problems with linked data points that can be modelled using graph theory and have gained prominence in this area. However, graph databases suffer from a (i) lack of standard query language for the different implementations of graph database management systems [64], (ii) inability to adequately handle transactional (i.e., relational) data [21], (iii) high cost of implementing distributed graph databases due to strong connectedness of data points (i.e., nodes). Hence, the high cost of partitions leads to complexity in the horizontal partitioning of graph databases [21, 65], (iv) high computational and time complexity as the volume of data increases, thus affecting scalability and data retrieval time [21, 66].
Document-oriented databases
Document-oriented databases are a class of NoSQL databases in which data are stored in files, mostly in JSON or XML formats [13, 15, 16–17, 23]. Hence, they are able to handle semi-structured data. Existing document-oriented databases such as BaseX [23], OrientDB [13], ArangoDB [14], Azure Cosmos DB [15], FaunaDB [16] and ElasticSearch [17], store each data object in a document/files (similar to an instance of a relation). This poses an advantage for the elimination of object-relational mapping since each object is stored in a document. Similar to the Key-Value database explained in section "Key Value Databases" of this work, the objects to be stored in the document-oriented database are determined at runtime at the point of insertion. Hence, no strict schema definition requirement is needed. However, unlike Key-Value databases, document databases are designed to support indices and search functionalities (Similar to the LIKE keyword of SQL) [67]. Document-oriented databases are a perfect fit for storing applications and network logs as these are exclusively textual data [17]. In a step to provide a data object aggregation functionality, some document-oriented databases such as MongoDB [14] provide a mechanism to group data objects together called buckets or collections. The bucket functionality allows the ease of data replication, persistence and security in groups. Although, document-oriented databases are dependent on the runtime data entry for data schema as stated earlier in this section, for simplicity, example 4 provides a relational database with an equivalent instance in a document-oriented database (i.e., in JavaScript Object Notation-JSON format).
Example 4
: Document-oriented database instance representation
Given the relational database’s relation with the schema
, extracted from the TPC-H PART-SUPPLIER database [36], with a relation instance of Table 8, obtain an equivalent instance of the relation in document-oriented database format.
Table 8. Relational Instance of PARTSUPP Relation of TPC-H
Partkey | Suppkey | Supplycost | Availqty |
|---|---|---|---|
1 | 2 | 3325 | 771.64 |
1 | 2502 | 8076 | 993.49 |
2 | 5002 | 3956 | 337.09 |
1 | 7502 | 4069 | 357.84 |
2 | 3 | 8895 | 378.49 |
Solution 4:Document-oriented database instance representation
To represent the PARTSUPP relation in JSON document-oriented databases such as MongoDB [14] and Elasticsearch [17], each attribute of the relation is mapped with its corresponding value as presented in Figure 9.
[See PDF for image]
Fig. 9
Document Instance of PARTSUPP Relation of TCP-H
To model the link between documents (i.e., data objects), document-oriented database implementations utilize either document embedding or referencing documents. The client application does not need to combine data across documents by embedding one-to-one or one-to-many relationships in a single document. With this, a single operation (GET or SEARCH) can obtain all data. A one-to-one connection can be represented as a list of attributes in the documents relationship reference document. A one-to-many connection, on the other hand, can be modelled by embedding the "one" side in the "many" side or by embedding the "many" side in the "one" side. However, embedding causes data denormalization and duplication, which can lead to data redundancy and consistency issues, particularly in write-intensive settings.
Document-oriented databases have high availability and partition tolerance with less consistency (i.e., eventual consistency). The operations on document databases include GET, PUT, DELETE and SEARCH. These operations are mostly coupled with programming languages and can be accessed through the document database management systems vendors’ provided API. Hence, their usage mostly requires technical expertise with no uniform or standardized query language among its different implementations [67].
SQL over NoSQL systems
Works such as Zidian [35], Fast Scans [51], Apache Hive [34] and SparkSQL [33] have been proposed towards enhancing the execution of SQL queries on NoSQL databases. This is done by building an SQL query layer on top of the NoSQL databases (i.e., "SQL over NoSQL"). For instance, Zidian [35] built an SQL layer on a Key-Value NoSQL store by grouping sets of related attributes into blocks and choosing an arbitrary key for a group of related attributes. An example is given below:
Example 5
:Transforming normalized relational database schema SQL BaaV format using Zidian [35]
Given a normalized relational database schema for the database ProductSupplier derived from a simplified TPC-H [36] database with sample instances in Tables 9, 10 and 11: Obtain the corresponding ProductSupplier for Zidian BaaV SQL schema.
SUPPLIER,,
PARTSUPP, , ,
NATION
Table 9. Instance of Supplier
Suppkey | Nationkey |
|---|---|
1 | 17 |
2 | 5 |
3 | 1 |
4 | 15 |
Table 10. Instance of PartSupp
Partkey | Suppkey | Availqty | Supplycost |
|---|---|---|---|
P34 | 2 | 3325 | 771.64 |
P33 | 4 | 8076 | 993.49 |
P12 | 1 | 3956 | 337.09 |
P09 | 3 | 4069 | 357.84 |
P1 | 2 | 8895 | 378.49 |
Table 11. Instance of Nation
Nationkey | Name |
|---|---|
1 | INDIA |
15 | INDONESIA |
5 | IRAN |
17 | IRAQ |
12 | JAPAN |
Solution 5: Transforming normalized relational database schema SQL BaaV format using Zidian [35]
To define Zidian BaaV, the following holds (Summary descriptions of notation are provided in Table 22):
Definition 2.1
(Zidian Block as a Value)
A Key Value (KV) schema is represented as a , where X and Y denote sets of attributes. Then, a keyed block over the KV schema takes the form of key-value pair (k, B), where k is a tuple over attributes X and B is a set of tuples over attributes Y. A Zidian BaaVKV instance (represented as ) of is a collection of keyed blocks over that possess unique keys. The degree of , which is denoted by deg(D), is determined by the largest size of keyed blocks (k, B) in . In other words, , where |B| signifies the number of tuples in B.
To create SQL BaaV from a normalized relational schema, Zidian uses the following:
Flatten all the relations to obtain a denormalized relational database schema.
For each relation in the schema, obtain an attribute(s) that can uniquely identify a block of tuples in the relation.
Step 1:
Flatten all the relations to have:
SUPPLIER(suppkey, nationkey),
PARTSUPP(partkey, suppkey, supplycost, availqty),
NATION(nationkey, name)
Step 2:
Loop through the corresponding relations in the PartSupplier database above and choose an arbitrary key (i.e., attribute(s)) that can identify a block of attributes in the individual relation.
with the instances in Tables 12, 13 and 14 respectively:
Table 12. Zidian BaaV Instance of Supplier
Suppkey | Nationkey |
|---|---|
1 | 17 |
2 | 5 |
3 | 1 |
4 | 15 |
Table 13. Zidian BaaV Instance of PartSupp
Suppkey | Partkey | Availqty | Supplycost |
|---|---|---|---|
2 | P34 | 3325 | 771.64 |
4 | P33 | 8076 | 993.49 |
1 | P12 | 3956 | 337.09 |
3 | P09 | 4069 | 357.84 |
2 | P1 | 8895 | 378.49 |
Table 14. Zidian BaaV Instance of Nation
Name | Nationkey |
|---|---|
INDIA | 1 |
INDONESIA | 15 |
IRAN | 5 |
IRAQ | 17 |
JAPAN | 12 |
Under a normalized relational database, nationkey, suppkey and name cannot be keys of the corresponding relations since they are not the respective primary keys of the relations when normalized. However, under Zidian BaaV, they are primary keys because Zidian [35], is not modelled based on relational database normal forms and is not relationally complete. Zidian is modelled on the block of tuples with the advantage of retrieving more blocks of data from a key-value database using an SQL layer on the key-value store.
Example 6
:Data retrieval query result on Zidian instance Given the TPC-H relational database schema of example 5 and the corresponding Zidian SQL BaaV PartSupp instance of Table 13 with key suppkey. Retrieve the supplycost of the supplier with suppkey 2.
Solution 6: Data retrieval query on Zidian instance
Table 15. Result of SuplyCost where suppkey equals 2
Suppkey | Supplycost |
|---|---|
2 | 771.64 |
2 | 378.49 |
As observed in Table 15 of the retrieved result, suppkey cannot uniquely identify the entries in the block: of the Zidian’s SQL BaaV instance of PartSupp.
From the problem 5 above, due to denormalization and lack of functional dependency (FD) of all the attributes of the block to a chosen key attribute, data integrity violation occurs in blocks formed. This is observed in the PartSupp block of the Zidian BaaV model where suppkey cannot uniquely identify the block in Table 13. Also, data (block) retrieval time is still limited to the relational database functionality as the database grows. Furthermore, Zidian only applies to Key-Value stores. Hence, not only that Zidian is not relationally complete (i.e., cannot answer all relational algebra and calculus-based queries as obtainable in relational databases), but it is only applicable to Key-value databases.
To the best of our knowledge, there is no existing system for NoSQL over SQL databases, particularly over document-oriented NoSQL databases. However, there are several existing works such as Zidian [35], Fast Scans [51], Apache Hive [34] and SparkSQL [33] that are developed for SQL over NoSQL. These systems are able to run some SQL queries (in addition to typical NoSQL queries of GET, PUT, and DELETE), but they do not completely eradicate other problems associated with NoSQL databases, such as (i) data inconsistency, (ii) lack of normalization (iii) requirement for specific programming language expertise for usage; associated with NoSQL databases. (iv) Additionally, because these systems rely on replicating data originally stored in NoSQL systems without strict structure into the SQL layer for onward querying while choosing keys arbitrarily on the relations, The retrieval time is still dependent on the size of the relation’s tuple.
In our earlier NOSD system [30], we proposed to integrate NoSQL over existing relational databases, through data replication in a block format from relational databases into NoSQL databases. However, this work extends NOSD [30] to enforce data consistency by keeping the integrity of replicated data from relational databases. This is achieved by UniqueNOSD’s NOSDReplica algorithm of section"Introduction" by ensuring that the same data blocks (i.e., relations’ instances) are not replicated more than once. Additionally, UniqueNOSD extends NOSD[30] by enforcing consistency and data integrity on the NoSQL database (Explained in Section "Operations on UniqueNOSD" of this work), improving the query operations that can be performed on the NoSQL database (explained in Section "Operations on UniqueNOSD" of this work) and ensuring that un-updated versions of the same data are not re-replicated with NOSDReplica.
CAPS theorem
Recent applications require high availability and scalability(i.e., partition tolerance), resulting in NoSQL database system designers having to decide which two out of the three qualities (Consistency, Availability and Partition Tolerance: CAP) to guarantee [14]. It is commonly accepted that ensuring consistency through the ACID (i.e., Atomicity, Concurrency, Isolation and Durability) characteristics is vital in many conventional relational database systems. However, a weaker consistency level is frequently acceptable in a NoSQL distributed database because it is crucial to guarantee the other two qualities (availability and partition tolerance). As a result, lesser consistency levels rather than availability guarantees are frequently used in NoSQL systems. As a result of this, eventual consistency [18] (i.e., when inserted data is not immediately available until replicated) is a type of consistency frequently used in NoSQL systems.
Generally, relational databases enforce consistency using the ACID properties. In CAP theorem, which is implemented for distributed systems with data replication and applicable to existing NoSQL databases, it is impossible to ensure all three of the required properties - consistency, availability, and partition tolerance-at the same time [4]. Figures 10 and 11 below depict how the CAP theorem affects existing database systems and where our system stands.
[See PDF for image]
Fig. 10
CAP Theorem and where UniqueNOSD stands
[See PDF for image]
Fig. 11
CAP Theorem Example
Proposed unique NoSQL over SQL block as a value system(UniqueNOSD)
In this section, we present UniqueNOSD. UniqueNOSD, which comprises three main phases, is presented in the subsequent subsections after presenting the overall component-based architecture of our system.
Architecture of the proposed model: UniqueNOSD
The three main components of UniqueNOSD are (i) Data Replication from Relational Database (explained with examples in Section "Data Replication from Relational Database to Document Oriented Database"), (ii) Data Mapping using Block as Value method (explained with examples in Section "Operations on UniqueNOSD") (iii) Query Evaluation and Operation (explained with examples in Section "Joint operator". The system architecture is provided in Fig. 12.
[See PDF for image]
Fig. 12
UniqueNOSD System Architecture
Data replication from relational database to document oriented database
One of the major drawbacks of relational databases is the increase in data retrieval time as the number of tuples increases in a relation [68]. Although, existing relational database research and database management implementations such as MySQL, MSSQL, and Oracle attempt to solve the drawback through attribute indexing. However, this does not completely eradicate the increase in retrieval time complexity as the tuples increase. Therefore, for existing relational database-reliant applications to take advantage of the faster retrieval time associated with NoSQL databases, the first module of UniqueNOSD works as a daemon process that replicates connected relational databases into the NoSQL database. Before persisting the relations into the NoSQL database, the relations are transformed using NoSQL over SQL of section "Operations on UniqueNOSD". We first present our proposed NOSDReplica algorithm creates a replica data from the relational database to the NoSQL database node(note: a node is a single point of storage on a NoSQL cluster, while a cluster consists of one or more nodes) with a high access rate exceeding a predefined threshold. The threshold is used to distinguish popular relations within the SQL database, based on their access history. The input data to the NOSDReplica algorithm is a SQL database relations’ access history, and the threshold is used to distinguish popular schema tables.
[See PDF for image]
Algorithm 1
NOSDReplica(AH,threshold)
We provide an explanation of the steps in the NOSDReplica algorithm, and also the running example with relational database relations replication as follows.
Steps in the proposed nosdreplica algorithm
The algorithm’s inputs are the tables containing records of access to the different relations in the database and the threshold used to distinguish popular relations.
To track the creation time of the replicated data, the current time is recorded in t at the beginning of the algorithm execution.
The method is invoked to scan the database’s relations access history (AH) and select the tables whose number of access values are greater or equal to the threshold.
L contains the information of the popular relations; these relations are to be replicated on suitable nodes on the NoSQL database.
For each relation r in L, NOSDReplica gets the associated relationID of the relation in the SQL database and the parent node p of the NoSQL database in line 4 and 5 of the algorithm.
The while loop of lines of NOSDReplica decides if the replica data should be created and where to take care of available space management of nodes in the NoSQL database.
If a replica already exists in the NoSQL database store, then there is no need to replicate this relation.
The creation time of the replica in 7 is updated to the current replication time by the function
Break the while loop and process the next relation in L
If the replica of f does not exist in p in step 7 above, and the available space in the NOSQL database node is large enough for the relation coming from the SQL module, function REPLICATE(f, p, t) is invoked to create new replica of f in p, and the creation time of the new replica will be set to t (Step in line 12)
Example 7
Running example of extended NOSDReplica algorithm Given the Table 16 containing the access history of 5 relations of the TPC-H database (Simplified), and Table 17 containing the access history of the nodes of an instantiated NoSQL database with node 0001 as the parent node (i.e. primary node). Using the proposed NOSDReplica algorithm, obtain the replicated relation given a threshold of 11 (i.e., number of previous recent access), where RID is the relational database relations ID.
Table 16. Relations Access History
RID | TableName | NumberOfAccess(S) | Size (MB) |
|---|---|---|---|
R1011 | PARTSUPP | 13 | 25 |
R1012 | SUPPLIER | 11 | 76 |
R1013 | PARTS | 5 | 91 |
R1014 | PRODUCTS | 20 | 21 |
R1015 | NATION | 10 | 67 |
Table 17. Nodes Update History
NodeID | UpdateTime | AvailSpace(MB) | RIDsUpdateNoSQL |
|---|---|---|---|
0001 | 23 | 900 | |
0002 | 43 | 987 |
Solution 7: Running example of extended NOSDReplica algorithm
Step 1:
Obtain t, i.e. the last access time of Relation R From Table 16, we create access sequences for the relations R of the relational database, sorted according to the number of access S, i,e, where and .
Step 2:
Obtain L i.e., the popular relations sorted according to their number of access in descending order given a threshold of 11. Select the tables whose number of access values is equal to or greater than 11.
Step 3:
Loop through L and check for R1012, if exists in , then update the created time (i.e. replicated time in NoSQL) and continue to the next element of L. For our example, since R1012 exists in f, p, R1012 is not replicated but only its last access time that is updated. Then for R1011 and R1014, since both of these relations ID is yet to be replicated (Since they are not present in (f, p). The output of Step 3 is:
Step 4:
The space at the primary node 001 is checked against incoming relation R1011; if R1011 is less than the size of R1011, then it is replicated. Otherwise, it moves to the next r of L (i.e., R1014). Since SIZE(R1011) and SIZE(R1014) are less than SIZE(f), R1011 and is R1014 are replicated respectively. Then SIZE(f) becomes 875MB, and 854MB correspondingly.
Hence, from L, the relations R1011 and R1014 are replicated. However, before the replicated relational tables are persisted into the NoSQL database, the data is transformed into the BaaV format described in section "Operations on UniqueNOSD".
At a time t, the available space of a replica node is the maximum storage that the particular NoSQL node can provide for new replicas. In the case of the automatic replica node allocation failure, the space must be checked before creating new replicas. For a particular node, the available space is the addition of its free space and the space occupied by the removable replicas . Denoting the available space on a NoSQL parent node as p by AS(p) and the free space at time t as FS(p).
1
Using the equation 1 above, the function in NOSDReplica obtains the available space of parent node p at the current replication time t to determine the removable replicas. The function will only be invoked if available storage on the destination replica node is greater or equal to the size of the incoming relation from the relational database.Data mapping: block as a value (BaaV) on NoSQL
Problem 1: Transforming relational database to uniqueNOSD NOSQL BaaV format Given a relational database PartSupplier and instances in Table 18, Table 19, Table 20 and Table 21 with each relation represented as , with a primary key attribute , is the key to , () and a set of attributes of the relations: . Obtain an equivalent document-oriented NoSQL database using Block as a Value () denoted as a tuple where means key and means , in the form: with a key attribute and B is a set of relations (i.e. r) called blocks and each contains a set of its own attributes and is denoted as with a key attribute and a set of attributes typical to a relational model. Keeping the relations in of related through foreign key relationships.
Table 18. Instance of Supplier
SuppKey | NationKey |
|---|---|
S101 | N17 |
S201 | N5 |
S103 | N1 |
S104 | N15 |
Table 19. Instance of PartSupp
PartKey | SuppKey | AvailableQty | Cost |
|---|---|---|---|
P11 | S201 | 3325 | 771.64 |
P11 | S104 | 8076 | 993.49 |
P22 | S104 | 3956 | 337.09 |
P11 | S103 | 4069 | 357.84 |
P22 | S103 | 8895 | 378.49 |
Table 20. Instance of Nation
NationKey | Name |
|---|---|
N5 | INDIA |
N1 | INDONESIA |
N17 | IRAN |
N15 | CHINA |
N11 | JAPAN |
Table 21. Instance of Part
PartKey | PartName |
|---|---|
8 | Chair |
9 | Pen |
10 | Shovel |
11 | Tractor |
12 | Fertilizer |
Solution to Problem 1: Transforming relational database to uniqueNOSD NOSQL BaaV format For each relation r (represented as ) in the PartSupplier database, obtain the Functional Dependencies () between all attributes in r.
FDs:
Step1:
Loop through the corresponding relations in the PartSupplier database above and remove all the primary keys without losing the attributes in their respective relations, and keeping the foreign key attributes (i.e., Partial database decomposition) to obtain a normalized (i.e., Not in 3rd Normal Forms) database schema.
,
,
The foreign keys are SuppKey, NationKey, Partkey
Step 2:
Loop through the corresponding relations in the PartSupplier database in Step1 above, and choose a key (i.e., the set of foreign key attributes without repetition) that can identify a block of attributes for the relations.
(K, B) = where as defined above. Therefore, and components is defined below:
, with the components of the block B defined as:
Table 22. Summary of UniqueNOSD Notations
Notation | Definition |
|---|---|
Relational Schema (i.e., relations) | |
Number of tuples in D | |
A Block as a Value (BaaV) Key-Value database of instance | |
A Block as a Value (BaaV) database of instance of UniqueNOSD on Document-oriented NoSQL database | |
R(Z) | Relational Schema under TaaV (i.e., Relational database that follows relational algebra and calculus rules) |
Equivalent Key Value schema under BaaV | |
Equivalent Document-Oriented NoSQL schema under BaaV of UniqueNOSD | |
Relational Schema of NoSQL BaaV | |
Attributes of | |
Primary key of |
Given problem 1 and its respective solution presented above, we formulate definition 3.1 on UniqueNOSD BaaV system on Document-oriented NoSQL stores:
Definition 3.1
(Key selection on uniqueNOSD NoSQL BaaV)
Given a Document-oriented schema represented under BaaV as . can have a primary key which is a subset of attributes such that for any of its instances and any Y-tuples and associated with the same key (i.e., X tuples), tuples and are distinct tuples on attributes. As denoted in the solution above, in UniqueNOSD NoSQL BaaV, it is not compulsory for but uniquely identifies every tuple where .
Mapping between NoSQL BaaV stores and relational stores
We present a NoSQL over SQL method for our UniqueNOSD BaaV mapping on a document-oriented database as defined in definition 3.2 below:
Definition 3.2
(Mapping between NoSQL BaaV stores and relational stores:)
Given a UniqueNOSD NoSQL instance of schema , which is a set of key-block pairs (K, B) of the document-oriented database. For each (called t) in r of block B, this is referred to as the (k, t) which is a tuple of . The relational version D of is the set of all tuples of (i.e., all (k, t) where of B).
Therefore, D is an instance of the relational schema (X, Y) when the set of B is flattened (i.e. When the key of B is flattened back to its respective primary and foreign keys normalized). Therefore, we provide definition 3.3.
Definition 3.3
(Mapping between NoSQL BaaV stores and relational stores:) Given a relational schema R, a database D of R and a document-oriented schema . The mapping of D on is considered a document block as a value store of . Then, for each instance D of R and any NoSQL BaaV schema . If X, Y are attributes in R then includes a instance of obtained from D By first projecting D on XY and then grouping by X based on the foreign key relationships, given that each in contains attributes from the same relational schema of .
We proof our NOSQL BaaV concepts with the examples below using a relational algebra operations approach in the section "Joint operator" below.
Operations on UniqueNOSD
We proof our BaaV approach for the following relational algebra database operations: , , and the following relational algebra extension operations: , and proven using the Knowledge-Based Authentication (KBA) approach.
SELECT operation ()
Example 8
: SELECT Operation () on UniqueNOSD Given a simplified version of UniqueNOSD BaaV TPC-H schema and instances for PartSupplier database relations presented in section "Operations on UniqueNOSD", with the schema:
, with the components of the block B defined as:
e.g., SELECT * FROM Supplier WHERE SuppKey = ’S103’;
Solution 8: SELECT operation () on uniqueNOSD
The SELECT operator is similar to the GET operator in NoSQL Given the NoSQL database instance above, the SELECT operation (implemented as a get operation) is used for retrieving (i.e., selecting) a subset of the tuples according to a given retrieval condition. This operation selects tuples satisfying a given predicate. This is expressed as:
2
where: the predicate (i.e., SELECT or GET in the case of our UniqueNOSD NoSQL BaaVr : the relation (i.e., table) and is the document of our UniqueNOSD BaaV.
the prepositional logic (i.e., condition)
From the given query, we have:
3
UniqueNOSD first scans through the document’s keys, i.e.,Then, scan through the individual components of the block to retrieve:
Table 23. Result of SELECT Query on UniqueNOSD
SuppKey | NationKey |
|---|---|
S103 | N1 |
The SELECT query can retrieve more data objects from UniqueNOSD documents than relational databases using the document key without entirely denormalizing the database.
Project operation ()
Similar to relational databases, this is used to choose attribute(s) from a relation (i.e., document on UniqueNOSD) while eliminating (i.e., not retrieving as part of the query results) attributes of the relation not mentioned in the query.
Example 9
Project operation () on uniqueNOSD
Given the NoSQL BaaV PartSupplier database relations presented in section "Operations on UniqueNOSD" with the instance of relation: , execute:
Solution 9: Project operation () on uniqueNOSD
As described under the select operation example in Section "Joint operator", we have the result presented in Table 24:
Table 24. Result of Project Query on UniqueNOSD
Name | |
|---|---|
Name | |
INDIA | |
INDONESIA | |
IRAN | |
CHINA | |
JAPAN |
For Join operations (Presented in example of Section "Joint operator"), UniqueNOSD follows the relational database query execution pattern with fewer scans since all foreign-key-related relations are implicitly stored on the same document. With this data locality, we can obtain bounded queries within the document.
To further explain the query execution on UniqueNOSD, we provide the Extension (), and Shift () relational algebra operations modelled for querying UniqueNOSD. These are queries specific to the key of blocks (i.e., data saved in a document as a whole and not the individual component relation of the block). We provide the below NoSQL BaaV simplified relations , , and .
[See PDF for image]
Fig. 13
Extension and Shift Operation Tables
Extension operator
Definition 3.4
(Extension operator on uniqueNOSD)
Given a UniqueNOSD BaaV instance of (X, Y), of (, Z) and XY. The extension of with represented as is achieved by mapping on a new BaaV instance as (XY, Z). As provided in table 22 above, (where ) is the relational version of the instance .
Also, is the same as the natural join on attributes in relational algebra. Therefore, the extension operation is obtained by extending the UniqueNOSD BaaV instance with Z attributes of using the values of i.e., (XY) as keys (recall that XY and is of (, Z)). To further simplify our Definition 3.4, the example 10 is given below:
Example 10
:Extension operation on uniqueNOSD
Given the simplified BaaV instances and of Figure 13, obtain the extension operation .
Solution 10:Extension operation on uniqueNOSD
Following the formulation of the Extension operator defined in Definition 3.4 of Section "Joint operator". Equating the following: of (X, Y) and of (, Z)
[See PDF for image]
Fig. 14
Extension Operator Example
Shift operator ()
Definition 3.5
(Shift operator ())
Given two BaaV instances of (X, Y), and . The uniqueNOSD operation of with represented as with the result is of . This is achieved by redistributing the keys and value attributes of .
Since in the given BaaV instance is set as the key. This allows us to perform the set of difference and union relational algebra operations on our BaaV model on the document-based NoSQL database. Example 11 provided below simplifies the provided Definition 3.5 (i.e., Shift Operation) of the UniqueNOSD NoSQL BaaV.
Example 11
: UniqueNOSD shift operator ()
Given the relation of Figure 14 as and A as , obtain the shift (i.e., )
Solution 11: UniqueNOSD shift operator ()
Using Definition 3.5 by equating the provided relation instance of Figure 14 to of (X, Y), and equating A of (i.e., ) to XY, we obtain the instance of shown in Figure 15.
[See PDF for image]
Fig. 15
Result of Shift Operation
Join operator
Definition 3.6
(Join operator on uniqueNOSD NoSQL BaaV):
Given two UniqueNOSD BaaV instances and on the attribute X, where X and , . The Join operation of and (i.e., ) is achieved by mapping the relational version of UniqueNOSD instance and over (i.e., based on) their common keys X.
Example 12
Join operator on uniqueNOSD NoSQL BaaV
Given the UniqueNOSD NoSQL BaaV instances and below, obtain the Join operation: .
Solution 12: Join operatoron uniqueNOSD NoSQL BaaV
Using Definition 3.6 defined above, the result of the join operator: , will result in and of Fig. 16.
[See PDF for image]
Fig. 16
UniqueNOSD Join Operation Result of
For implementation on simpler systems such as an application with smaller number of potential users, UniqueNOSD has an initial additional replication time that is greater than using an ordinary relational or NOSQL databases without an SQL layer. This drawback is removed as the system becomes busy with transactions and data replication (see Section "Data Replication from Relational Database to Document Oriented Database").This trades off storage and initial set-up time for increased query execution time and scalability.
Experimental evaluation
In this section, we present the experimental setup and then the result and analysis of our work.
Dataset and implementation details
TPC-H Dataset1: The TPC-H dataset consists of eight normalized (3rd Normal Form (3NF)) relations widely used for the TPC Benchmark (a database benchmark for testing complex decision support systems), which allows for small SQL queries to be evaluated against larger datasets. We populated the TPC-H tables using a generated dataset of the DBGen2 with a size of 100GB. There are 61 attributes in the eight tables of TPC-H, and we tested the UniqueNOSD system with TPC-H DBGen’s inbuilt 22 queries (See query properties in Table 26). Table 25 provides details of the six relations’ instances of TPC-H datasets relative to different scale factors used. The scale factor is a variable configured in DBGen, and the database is sized according to the scale factor as stated in Table 25. Table 26 provides the summary of the DBGen TPC-H generated queries, and the TPC-H instance specifications are summarized as:
TPC-H database is designed with compliance to the 3rd normal form normalization standard.
Performance of TPC-H is evaluated by applying specified database sizes (Table 25), which are generally referred to as "scale factors." Each scale factor represents the data warehouse’s raw data size. We used a scale factor of 100 in our test cases of the different baseline systems that we compared with UniqueNOSD (See Section "Benchmark Systems and Evaluation Metrics"), resulting in a 100GB data size.
Six of the eight tables generated for our TPC-H databases have a proportionate increase in size with regard to the scaling factor and are filled with data that is uniformly distributed, as shown in Table 25. The two remaining tables of TPC-H (i.e., Nation and Region) have a definite number of tuples since the number of Nations and Regions existing is definite.
To demonstrate concurrency, 22 complex TPC-H’s DBGen queries and two data refresh processes (insert and delete) are conducted simultaneously.
The number of concurrent processes grows as the scale factor rises when the scale factor is varied in accordance with Table 25.
Canada Weather Dataset: 3 This dataset consists of 27 attributes across 13 tables. We bulk insert the weather datasets of Canadian 12-month (December 2021 - November 2022).
US Bureau of Transportation Statistics Dataset4: This data consists of the US flight statistics; The dataset used consists of 7 tables and 358 attributes of about 35GB from January 2018 to December 2022.
UK Ministry of Transportation Dataset5: This comprises UK ministry of transportation vehicle test data for vehicles passing the MOT observation points from 2009 to 2015. This data has 23 attributes across 3 tables and 16GB in size.
Table 25. TPC-H Dataset Tables Number of Rows According to ScaleFactor
TPCH Scale (GB) | Table (No of rows) | |||||
|---|---|---|---|---|---|---|
lineitem | orders | partsupp | part | customer | supplier | |
10 | 60M | 15M | 8M | 2M | 1.5M | 0.1M |
20 | 120M | 30M | 16M | 4M | 3M | 0.2M |
50 | 300M | 75M | 40M | 10M | 7.5M | 0.5M |
100 | 600M | 150M | 80M | 20M | 15M | 1M |
200 | 1.2B | 300M | 160M | 40M | 30M | 2M |
1000 | 6B | 1.5B | 800M | 200M | 150M | 10M |
Table 26. TPC-H Query Dataset Properties
Query type | Query properties | Query numbers | Sample query |
|---|---|---|---|
A | The queries are of medium dimensionality and the result is TPC-H scale factor independent. | Q1, Q3, Q4, Q5, Q6, Q7, Q8, Q12, Q13, Q14, Q16, Q19, Q22 | SELECT lreturnflag, llinestatus, SUM(lquantity) AS sumqty, SUM(lextendedprice) AS SUMbaseprice, SUM(lextendedprice(1-ldiscount)) AS sumdiscprice, SUM(lextendedprice(ldiscount)(ltax)) AS sumcharge, AVG(lquantity) AS avgqty, AVG(lextendedprice) AS avgprice, AVG(ldiscount) AS avgdisc, COUNT(*) AS countorder FROM lineitem WHERE lshipdatemdy (12, 01, 1998) - 90 units day GROUP BY lreturnflag, llinestatus ORDER BY lreturnflag, llinestatus; |
B | Query Dimensionality is high with Few results and many empty cells | Q15, Q18 | SELECT FIRST 100 cname, ccustkey, oorderkey, oorderdate, ototalprice, SUM(lquantity) FROM customer, orders, lineitem WHERE oorderkey IN (SELECT lorderkey FROM lineitem GROUP BY lorderkey having SUM(lquantity)) AND ccustkeyocustkey AND oorderkey = lorderkey GROUP BY cname, ccustkey, oorderkey, oorderdate, ototalprice ORDER BY ototalprice DESC, oorderdate |
Table 27. TPC-H Query Dataset Properties Cont
Query type | Query properties | Query numbers | Sample query |
|---|---|---|---|
C | High dimensionality | Q2, Q9, Q10, Q11, Q17, Q20, Q21 | SELECT nation, oyear, SUM(amount) AS sumprofit FROM (SELECT nname AS nation, YEAR(oorderdate) AS oyear, lextendedprice (discount) − pssupplycost lquantity AS amount FROM part, supplier, lineitem, partsupp, orders, nation WHERE ssuppkey lsuppkey AND pssuppkey lsuppkey AND pspartkey lpartkey AND ppartkey lpartkey AND oorderkey lorderkey AND snationkey nnationkey AND pname LIKE ) AS profit GROUP BY nation, oyear ORDER BY nation, oyear DESC |
Implementation
To prove our concept, we implemented UniqueNOSD and below are the details of our implementation and deployments for experimentation.
We used Java programming language with Spring framework to implement all the algorithms of UniqueNOSD. We used three relational database management systems (MySQL, MSSQL, and Oracle) to experiment with the relational database module. We used Elasticsearch [17] and MongoDB [14] to build the document-oriented NoSQL module. In both the and the relational module we implemented to ensure the same relational data is not replicated more than once and only updated when there is a new version on the relational database, based on our proposed NOSDReplica algorithm explained in section "Introduction". Additionally, check if is data preserving particularly on the integrity of the data to avoid data redundancy issues. The relational to BaaV generation and mapping module of UniqueNOSD is platform-independent and can be deployed on any relational database with only connection credentials. However, the NoSQL connector is platform dependent as different Document-oriented NoSQL implementations have different data specifications and connection configurations requirements. For Elasticsearch NoSQL implementation, the number of nodes for the NoSQL is specified in the Elasticsearch configuration file (elasticsearch.yml). The indexing module of the UniqueNOSD is handled by Elasticsearch’s integrated shard indexing engine and its enhanced Lucene customized query technology.
Node parallelization on NoSQL
Each index is divided into smaller parts (called shards) in our implementation, and these shards are spread over numerous nodes of our Elasticsearch datastore (NoSQL database). The system is set to survive failures by defining the number of primary shards and replicas in the Elasticsearch configuration file. This preserves data integrity when the primary shard fails and improves NoSQL performance. This is because replica shards handle Elasticsearch search queries.
Autonomous data replication
In our implementation, for our NOSDReplica algorithm to work continuously, we created a module that automatically finds database structure/schema and builds its corresponding schema. To achieve this, we used Java Persistent API6 (JPA) with Java Spring Framework7. In addition, we created a Cron job in Java (i.e., using the SpringBoot Scheduler class8). A configuration file for the crone task specifies the time period for data replication/pulling and the database. For replication, this can be set to any desired time period.
Query interfaces and API
This comprises two services, the user query interface and the application programming interface (API) so that other applications could query the NoSQL over the SQL system. The framework built provides an easy-to-use layer for any corporate application to effortlessly integrate and publish their queries on UniqueNOSD. Elasticsearch is founded on Apache Lucene9, a sophisticated library that includes indexing and search technologies as well as advanced tokenization techniques. Furthermore, our system has a user-friendly data visualization interface. We tested our implementation with the TPC-H queries10 described in Table 26. TPC-H queries TPC-H queries q2, q3, q5, q7, q8, q10, q11, q12, q17, q19, and q21 are scanfree but not constrained due to large datasets.
Benchmark systems and evaluation metrics
We deploy our implementations on a cluster of 5 EC2 m7i-flex.4xlarge instances, with 64GB memory, 16vCPU and 1TB for each instance. Each of these instances works as both the storage and computing node and we ran our test 5 times with average reported. We evaluated our UniqueNOSD BaaV for NoSQL over SQL implementation relative to relational database systems (MySQL, MSSQL and Oracle) with the average reported for all the relational databases. We also evaluated against SQL over NoSQL systems of Zidian [35], Apache Hive [34] and SparkSQL [33]. We evaluated UniqueNOSD against other benchmark systems based on (1) query evaluation time, (2) Index construction time and (3) Index update time.
Note that we did not use GET on the NoSQL database because it retrieves more relations (i.e., blocks) and contradicts our evaluations of the relational databases and the SQL over NoSQL databases. Furthermore, GET on NoSQL will retrieve all foreign key-related relations (the whole block in the document), Hence, we structure our query on the BaaV to be similar to SELECT, and whenever our queries require joining related table, we structure the query on BaaV system to suits this as well. Deductively and experimentally, UniqueNOSD has a low retrieval time than any of the existing systems due to data locality (i.e., all foreign key-related relations are stored on the same document).
Results and analysis
Next, we report the results of our experiments. We present our results based on the four datasets used as explained in section "Dataset and Implementation Details". The average scores obtained based on the metrics of query evaluation time, and index construction (i.e., creating an index on an attribute of a relation), updating the index, and executing an update operation are presented in Tables 28, 29, 30, 31 and Figures 17, 18, 19 and 20 with their respective systems.
Comparing the results data, UniqueNOSD could scale as the size of the data increased while reducing the retrieval time. Due to the data locality of BaaV on documented-oriented databases in addition to the Lucene technology on which document-oriented databases are built, The system (UniqueNOSD) was able to execute both bounded and scan-free queries faster than existing SQL over NoSQL systems. Furthermore, our systems enforce data integrity and consistency without compromising data availability and partition tolerance. From our experimentation, existing SQL over NoSQL and ordinary relational databases suffer from an increase in retrieval time as the number of tuples increases. This is in addition to their inability to be relationally complete (i.e., their operations do not follow relational algebra rules).
Table 28. Experiment Result on TPC-H Dataset
Avg. query evaluation (Retrieval) time(ms) | Avg. index construction time(ms) | Avg. index update time(ms) | Update query evaluation Time (ms.) | |
|---|---|---|---|---|
Apache Hive | 49.53 | 42.99 | 65.53 | 70.4 |
NOSD | 42.13 | 42.12 | 59.53 | 63.99 |
UniqueNOSD | 32.48 | 32.07 | 55.48 | 63.23 |
Zidian | 47.41 | 43.03 | 64.41 | 68.7 |
SparkSQL | 49.77 | 49.01 | 65.77 | 70.64 |
Relational Database | 49.99 | 50.95 | 99.01 | 89.64 |
On the TPC-H dataset and its corresponding generated queries11 for database performance evaluation. We experiment with 5 other systems based on data retrieval time, create index query execution time, update index query execution time, and update data query execution time.
[See PDF for image]
Fig. 17
Experimental Result on TPC-H Dataset
Table 29. Experiment Result on UK-MOT Dataset
Avg. query evaluation (Retrieval) time(ms) | Avg. index construction time(ms) | Avg. index update time(ms) | Update query evaluation time (ms.) | |
|---|---|---|---|---|
Apache Hive | 3.58 | 3.09 | 5.55 | 3.85 |
NOSD | 2.43 | 2.43 | 4.9 | 3.5 |
UniqueNOSD | 2.19 | 2.01 | 3.8 | 3.02 |
Zidian | 2.96 | 2.96 | 6.6 | 3.82 |
SparkSQL | 3.95 | 3.05 | 8.6 | 3.95 |
Relational Database | 3.91 | 2.99 | 8.9 | 3.99 |
[See PDF for image]
Fig. 18
Experiment Result on UK-MOT Dataset
Table 30. Experiment Result on Canadian Weather Dataset
Avg. query evaluation (Retrieval) time(ms) | Avg. index construction time(ms) | Avg. index update time(ms) | Update query evaluation time (ms.) | |
|---|---|---|---|---|
Apache Hive | 72.11 | 75.49 | 73.9 | 74.72 |
NOSD | 51.02 | 51.83 | 51.98 | 55.83 |
UniqueNOSD | 48.37 | 45.12 | 50.21 | 54.33 |
Zidian | 56.35 | 59.76 | 54.89 | 57.05 |
SparkSQL | 66.29 | 78.77 | 77.65 | 78.76 |
Relational Database | 77.83 | 79.91 | 78.91 | 86.4 |
[See PDF for image]
Fig. 19
Experiment Result on Canadian Weather Dataset
[See PDF for image]
Fig. 20
Experiment Result on US Flight Traffic Dataset
Table 31. Experiment Result on US Flight Traffic Dataset
Avg. query evaluation (Retrieval) time(ms) | Avg. index construction time(ms) | Avg. index update time(ms) | Update query evaluation time (ms.) | |
|---|---|---|---|---|
Apache Hive | 5.38 | 42.53 | 8.37 | 5.55 |
NOSD | 3.01 | 39.13 | 7.61 | 3.9 |
UniqueNOSD | 2.19 | 22.48 | 6.33 | 2.8 |
Zidian | 3.96 | 42.41 | 8.27 | 3.6 |
SparkSQL | 5.95 | 44.77 | 8.53 | 5.16 |
Relational Database | 6.91 | 62.01 | 8.67 | 5.9 |
Conclusion and future work
Because of its capacity to store unstructured or semistructured and heterogeneous data, shorter data retrieval times, and the ability to grow horizontally with minimal downtime, NoSQL databases have now become one of the most often utilized database systems in most bigdata scenarios. This results in a widespread need for the adoption of big data in industry and research. However, the limitations of these types of databases originating from their design limit their scope of use. Limitations such as the inability to process complex queries and data inconsistencies impede NoSQL usage in highly sensitive scenarios. Existing works of SQL over NoSQL attempt to address the querying limitations of NoSQL databases but also suffer from data inconsistency and data integrity and adversely introduce the horizontal scalability issues of relational databases. In this work, we propose a model that enables document-oriented types of NoSQL databases to store data in a relational block format using a block as a value (BaaV) method. We demonstrated that our model is able to store data relationally and uniquely to ensure integrity and improve data retrieval time. To allow existing companies and solutions that rely on relational databases to use the NoSQL advantages, we provide a module called NOSDReplica algorithm replicating relational data into a document-based NoSQL database in the background. In addition to UniqueNOSD custom query operations of Extension () and Shift(), our proposed UniqueNOSD model can handle relational database query operations such as SELECT, PROJECT, UNION, JOIN with faster retrieval time than existing benchmark systems of both relational databases and NoSQL over SQL database systems. The experiments were performed on a universally accepted database benchmark schema and dataset for Bigdata TPC-H, Canadian atmospheric weather data and the United States flight traffic data show that our model outperforms existing methods. For scenarios involving not frequently accessed relations or yet to be replicated relations, UniqueNOSD acquires an initial additional execution time for replication (see Section "Data Replication from Relational Database to Document Oriented Database") that is greater than ‘ordinary relational database’ or ‘NOSQL database’ without any SQL layer. This limitation of UniqueNOSD only occurs at the initial set-up and is overcome as the system gets busy with transactions (queries) with replication. This can also be avoided for systems already deployed in relational databases by having data replication completed before directing queries to the UniqueNOSD query interface. Thereby, trading off storage and initial set-up time for steady increase in query execution time (i.e., availability) and scalability. On future work, we intend to mine patterns from data stored on the NoSQL database in the future since most data mining algorithms require a specific data structure which is usually in a relational format. Deploying pattern mining techniques on UniqueNOSD will be of immense benefit to existing recommender systems, particularly in the e-commerce industry, where it can serve as a suitable and better near-relational NoSQL database replacement. Additionally, with the reduction in retrieval time, saving e-commerce multimedia data such as images and audio on document-based NoSQL will be explored in relation to pattern mining. Lastly, we aim to enhance the data replication algorithm background process for replicating data from SQL to NoSQL.
Acknowledgements
This research was supported by the Natural Science and Engineering Research Council (NSERC) of Canada under an Operating grant (OGP-0194134) and a University of Windsor grant.
Author contributions
All Authors equally contributed to this paper. All authors read and approved the manuscript.
Funding
This research was supported by the Natural Science and Engineering Research Council (NSERC) of Canada under an Operating grant (OGP-0194134) and a University of Windsor grant.
Data availability
The four datasets that were used in this study are publicly available for download. The first dataset, ‘Transaction Processing Council H Dataset,’ which was referred to as TPC-H is available for download at https://www.tpc.org/tpch/.
The second dataset, ‘The United Kingdom Ministry of Transportation Vehicle Inspection Dataset,’ referred to as the UK-MOT Dataset, is available for download at https://www.data.gov.uk/dataset/52e1e2ab-5687-489b-a4d8-b207cd5d6767/roadside-survey-of-vehicle-observations
The third dataset: ’Canadian Weather Dataset’, referred to as the Canadian Weather Dataset, is available for download at https://dd.weather.gc.ca/climate/observations/monthly/csv/
The fourth dataset: ’The United State Flight Traffic Dataset’, referred to as the US Flight Dataset, is available for download at https://www.transtats.bts.gov/DL SelectFields.aspx
Declarations
Ethics approval and consent to participate
This research does not involve humans or animals. The datasets used in this research are publicly available online, and the links to them have been provided in the article.
Consent for publication
All authors read and approved the final manuscript.
Competing interests
The authors declare no competing interests.
Acronyms/Abbreviations
Aggregate bottom-Up algorithm
The nth attribute of a relation
Block
Block as a value
Binary large objects
Consistency availability and partitioning
JavaScript object notation
The key of a normalized relational database table
The key to uniqueNOSD BaaV relation/document
Key value database
NoSQL over SQL database
Not SQL or ‘Not only SQL’
Structure query language
Object oriented database
Object oriented paradigm/programming
UniqueNOSD BaaV relation/document
Extensible markup language
Resource description framework
https://www.tpc.org/tpch/default5.asp
2https://docs.deistercloud.com/content/Databases.30/TPCH%20Benchmark.90
3https://dd.weather.gc.ca/climate/observations/monthly/csv/
4https://www.transtats.bts.gov/DL_SelectFields.aspx
5https://www.data.gov.uk/dataset/52e1e2ab-5687-489b-a4d8-b207cd5d6767/roadside-survey-of-vehicle-observations
6https://docs.spring.io/spring-data/jpa/docs/current/reference/html/
7https://docs.spring.io/spring-framework/reference/
8https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/scheduling/package-summary.html
9https://blogs.oracle.com/javamagazine/post/easy-searching-with-elasticsearch
10https://docs.deistercloud.com/content/Databases.30/TPCH%20Benchmark.90/Sample%20querys.20.xml
11https://docs.deistercloud.com/content/Databases.30/TPCH%20Benchmark.90/Sample%20querys.20.xml
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Ruflin N, Burkhart H, Rizzotti S. Social-data storage-systems. In: Databases and Social Networks. Athens, Greece: Association for Computing Machinery; 2011. p. 7–12. https://doi.org/10.1145/1996413.1996415.
2. Kuszera EM, Peres LM, Fabro MDD. Toward rdb to nosql: Transforming data with metamorfose framework. In: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, pp. 456–463. Association for Computing Machinery, Limassol, Cyprus (2019). https://doi.org/10.1145/3297280.3299734
3. Aissa MMB, Sfaxi L, Robbana R. Decisional architectures from business intelligence to big data: Challenges and opportunities. In: Proceedings of the 2nd International Conference on Digital Tools & Uses Congress, pp. 1–9. Association for Computing Machinery, Virtual Event, Tunisia (2020). https://doi.org/10.1145/3423603.3424049.
4. Gilbert, S; Lynch, N. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News; 2020; 33,
5. Sadalage, PJ; Fowler, M. NoSQL Distilled; a Brief Guide to the Emerging World of Polyglot Persistence; 2013; Portland, OR, Addison Wesley Professional:
6. Martínez-Bazan N, Muntés-Mulero V, Gómez-Villamor S, Nin J, Sánchez-Martínez M-A, Larriba-Pey J.-L. Dex: High-performance exploration on large graphs for information retrieval. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 573–582. Association for Computing Machinery, Lisbon, Portugal (2007). https://doi.org/10.1145/1321440.1321521
7. Gilbert, S; Lynch, N. Tigergraph: Tigergraph cloud. Database Trends & Applications; 2019; 32, 40. [DOI: https://dx.doi.org/10.1145/564585.564601]
8. Knuth, DE. The Stanford GraphBase: A Platform for Combinatorial Computing; 1993; New York, NY, USA, Association for Computing Machinery:
9. Memgraph Ltd: Connect and analyze complex data pipelines at scale and in real-time (2023). https://memgraph.com/ Accessed Accessed 05 July (2023)
10. Gupta, P; Mhedhbi, A; Salihoglu, S. Columnar storage and list-based processing for graph database management systems. Proc VLDB Endow; 2021; 14,
11. Webber, J.: A programmatic introduction to neo4j. SPLASH ’12, pp. 217–218. Association for Computing Machinery, New York, NY, USA (2012). https://doi.org/10.1145/2384716.2384777
12. Raman, R., van Rest, O., Hong, S., Wu, Z., Chafi, H., Banerjee, J.: Pgx.iso: Parallel and efficient in-memory engine for subgraph isomorphism. In: Proceedings of Workshop on GRAph Data Management Experiences and Systems. GRADES’14, pp. 1–6. Association for Computing Machinery, New York, NY, USA (2014). https://doi.org/10.1145/2621934.2621939
13. OrientDB: OrientDB (2023). https://orientdb.org/docs/3.0.x/misc/Overview.html Accessed Accessed 05 July 2023
14. Mavrogiorgos, K., Kiourtis, A., Mavrogiorgou, A., Kyriazis, D.: A comparative study of mongodb, arangodb and couchdb for big data storage. In: Proceedings of the 2021 5th International Conference on Cloud and Big Data Computing. ICCBDC ’21, pp. 8–14. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3481646.3481648
15. Paz, J; Guay, R. Microsoft Azure Cosmos DB Revealed: A Multi-Modal Database Designed for the Cloud; 2018; 1 USA, Apress:
16. FaunaDB: FaunaDB: Let developers fly (2023). https://fauna.com/ Accessed Accessed 06 July 2023
17. Gheorghe, R; Hinman, ML; Russo, R. Elasticsearch in Action; 2015; 1 USA, Manning Publications Co.:
18. DeCandia, G; Hastorun, D; Jampani, M; Kakulapati, G; Lakshman, A; Pilchin, A et al. Dynamo: Amazon’s highly available key-value store. ACM SIGOPS Oper Syst Rev; 2007; [DOI: https://dx.doi.org/10.1145/1323293.1294281]
19. Rajesh N, Hans F, Steven G, Marc K, Herman L, Harry CL, Ryan M, Mike P, Daniel P, Paul S, David S, Tony T, Venkateshwaran VFI. Scaling memcache at facebook. In: 10th USENIX Symposium on Networked Systems Design and Implementation. NSDI ’13 (2013). https://research.facebook.com/publications/scaling-memcache-at-facebook/
20. Chang, F; Dean, J; Ghemawat, S; Hsieh, WC; Wallach, DA; Burrows, M et al. Bigtable: a distributed storage system for structured data. ACM Trans Comput Syst; 2008; [DOI: https://dx.doi.org/10.1145/1365815.1365816]
21. Kolomičenko V, Svoboda M, Mlýnková IH. Experimental comparison of graph databases. In: Proceedings of International Conference on Information Integration and Web-Based Applications & Services. IIWAS ’13, pp. 115–124. Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2539150.2539155
22. de Lima C, dos Santos Mello R. A workload-driven logical design approach for nosql document databases. In: Proceedings of the 17th International Conference on Information Integration and Web-Based Applications & Services. iiWAS ’15. Association for Computing Machinery, New York, NY, USA (2015). https://doi.org/10.1145/2837185.2837218
23. Panzeri E, Pasi G. Flex-basex: An xml engine with a flexible extension of xquery full-text. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’13, pp. 1083–1084. Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2484028.2484216
24. MongoDB Atlas: Query with Atlas SQL (2025). https://www.mongodb.com/docs/atlas/data-federation/query/query-with-sql/ Accessed Accessed 09 March 2025
25. Corbett, JC; Dean, J; Epstein, M; Fikes, A; Frost, C; Furman, JJ et al. Spanner: Google’s globally distributed database. ACM Trans Comput Syst; 2013; [DOI: https://dx.doi.org/10.1145/2491245]
26. Linskey PC, Prud’hommeaux M. An in-depth look at the architecture of an object/relational mapper. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. SIGMOD ’07, pp. 889–894. Association for Computing Machinery, New York, NY, USA (2007). https://doi.org/10.1145/1247480.1247581
27. Ghandeharizadeh S, Mutha A. An evaluation of the hibernate object-relational mapping for processing interactive social networking actions. In: Proceedings of the 16th International Conference on Information Integration and Web-Based Applications & Services. iiWAS ’14, pp. 64–70. Association for Computing Machinery, New York, NY, USA (2014). https://doi.org/10.1145/2684200.2684285.
28. Mazumdar, S; Seybold, D; Kritikos, K; Verginadis, Y. A survey on data storage and placement methodologies for cloud-big data ecosystem. J Big Data; 2019; 6, [DOI: https://dx.doi.org/10.1186/s40537-019-0178-3] 15.
29. OrientDB: OrientDB SQL (2025). https://orientdb.com/docs/3.0.x/gettingstarted/Tutorial-SQL.html Accessed Accessed 09 March 2025
30. Gidado, A., Ezeife, C.I.: Maximizing bigdata retrieval: Block as a value for nosql over sql. In: Proceedings of the 2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. ASONAM ’22, pp. 556–563. IEEE Press, Istanbul, Turkey (2023). https://doi.org/10.1109/ASONAM55673.2022.10068692
31. Muła, W; Lemire, D. Faster base64 encoding and decoding using avx2 instructions. ACM Trans Web; 2018; [DOI: https://dx.doi.org/10.1145/3132709]
32. Crane, M., Lin, J.: An exploration of serverless architectures for information retrieval. In: Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval. ICTIR ’17, pp. 241–244. Association for Computing Machinery, ??? (2017). https://doi.org/10.1145/3121050.3121086.
33. Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark sql: Relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. ACM, Melbourne, Victoria, Australia (2015)
34. Camacho-Rodríguez, J., Chauhan, A., Gates, A., Koifman, E., O’Malley, O., Garg, V., Haindrich, Z., Shelukhin, S., Jayachandran, P., Seth, S., Jaiswal, D., Bouguerra, S., Bangarwa, N., Hariappan, S., Agarwal, A., Dere, J., Dai, D., Nair, T., Dembla, N., Vijayaraghavan, G., Hagleitner, G.: Apache hive: From mapreduce to enterprise-grade big data warehousing. In: Proceedings of the 2019 International Conference on Management of Data. SIGMOD ’19, pp. 1773–1786. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3299869.3314045.
35. Cao, Y; Fan, W; Yuan, T. Block as a value for SQL over NoSQL. Proc VLDB Endow; 2019; 12,
36. Benchmark TPPC. Transaction Processing Performance Council Benchmark-H (2023). https://www.tpc.org/tpch/ Accessed Accessed 06 August 2023
37. Software D. TPCH Benchmark (2023). https:://docs.deistercloud.com/content/Databases.30/TPCH20Benchmark.90 Accessed 2023-10-20
38. Tang, M; Lee, B-S; Yeo, C-K; Tang, X. Dynamic replication algorithms for the multi-tier data grid. Future Gener Comput Syst; 2005; 21,
39. Cassandra A. Apache Cassandra, Open Source NoSQL Database, Manage massive amounts of data, fast, without losing sleep. Available online, accessed 2023-05-10 (2023). https://cassandra.apache.org/index.html
40. De Porre K, Gonzalez Boix E. Squirrel: An extensible distributed key-value store. In: Proceedings of the 4th ACM SIGPLAN International Workshop on Meta-Programming Techniques and Reflection. META 2019, pp. 21–30. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3358502.3361271.
41. Iordanov, B.: Hypergraphdb: A generalized graph database. In: Lecture Notes in Computer Science (including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 25–36. Springer, Berlin, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16720-1_3
42. Angles, R; Arenas, M; Barceló, P; Hogan, A; Reutter, J; Vrgoč, D. Foundations of modern query languages for graph databases. ACM Comput Surv; 2017; [DOI: https://dx.doi.org/10.1145/3104031]
43. Guagliardo, P; Libkin, L. A formal semantics of SQL queries, its validation, and applications. Proc VLDB Endow; 2017; 11,
44. Negri, M; Pelagatti, G; Sbattella, L. Formal semantics of sql queries. ACM Trans Database Syst; 1991; 16,
45. Bakkum P, Skadron K. Accelerating sql database operations on a gpu with cuda. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. GPGPU-3, pp. 94–103. Association for Computing Machinery, New York, NY, USA (2010). https://doi.org/10.1145/1735688.1735706
46. Chattopadhyay, B; Lin, L; Liu, W; Mittal, S; Aragonda, P; Lychagina, V et al. Tenzing a SQL implementation on the mapreduce framework. Proc VLDB Endow; 2011; 4,
47. Kohn A, Leis V, Neumann T. Building advanced sql analytics from low-level plan operators. In: Proceedings of the 2021 International Conference on Management of Data. SIGMOD ’21, pp. 1001–1013. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3448016.3457288
48. Emani KV, Ramachandra K, Bhattacharya S, Sudarshan S. Extracting equivalent sql from imperative code in database applications. In: Proceedings of the 2016 International Conference on Management of Data. SIGMOD ’16, pp. 1781–1796. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2882903.2882926
49. Armstrong TG, Ponnekanti V, Borthakur D, Callaghan M. Linkbench: A database benchmark based on the facebook social graph. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. SIGMOD ’13, pp. 1185–1196. Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2463676.2465296.
50. Venkataramani V, Amsden Z, Bronson N, Cabrera III G, Chakka P, Dimov P, Ding H, Ferris J, Giardullo A, Hoon J, Kulkarni S, Lawrence N, Marchukov M, Petrov D, Puzar L. Tao: How facebook serves the social graph. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. SIGMOD ’12, pp. 791–792. Association for Computing Machinery, New York, NY, USA (2012). https://doi.org/10.1145/2213836.2213957.
51. Pilman, M; Bocksrocker, K; Braun, L; Marroquín, R; Kossmann, D. Fast scans on key-value stores. Proc VLDB Endow; 2017; 10,
52. Kunchithapadam K, Zhang W, Ganesh A, Mukherjee N. Oracle database filesystem. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. SIGMOD ’11, pp. 1149–1160. Association for Computing Machinery, New York, NY, USA (2011). https://doi.org/10.1145/1989323.1989445
53. Tomasic A, Garcia-Molina H. Caching and database scaling in distributed shared-nothing information retrieval systems. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pp. 129–138. ACM, Washington, D.C., USA (1993)
54. Robertson, S.E., Kanoulas, E.: On real-time ad-hoc retrieval evaluation. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1119–1120. ACM, Portland, Oregon, USA (2012)
55. Chaparro, O., Marcus, A.: On the reduction of verbose queries in text retrieval based software maintenance. In: Proceedings of the 38th International Conference on Software Engineering Companion. ICSE-C, pp. 716–718. IEEE/ACM, Austin, Texas (2016)
56. Vogels, W. Eventually consistent. Commun ACM; 2009; 52,
57. Kirkpatrick D. Determining graph properties from matrix representations. In: Proceedings of the Sixth Annual ACM Symposium on Theory of Computing. STOC ’74, pp. 84–90. Association for Computing Machinery, New York, NY, USA (1974). https://doi.org/10.1145/800119.803888
58. Zhu, X; Feng, G; Serafini, M; Ma, X; Yu, J; Xie, L et al. Livegraph: a transactional graph storage system with purely sequential adjacency list scans. Proc VLDB Endow; 2020; 13,
59. Gupta, P; Mhedhbi, A; Salihoglu, S. A faster algorithm for betweenness centrality. The Journal of Mathematical Sociology; 2001; 25,
60. Haghir Chehreghani M, Bifet A, Abdessalem T. Adaptive algorithms for estimating betweenness and k-path centralities. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management. CIKM ’19, pp. 1231–1240. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3357384.3358064
61. Neo4j: RDF Triple Stores vs. Labeled Property Graphs (2023). https://neo4j.com/blog/rdf-triple-store-vs-labeled-property-graph-difference/ Accessed Accessed 06 July 2023
62. ArangoDB: ArangoDB (2023). https://www.arangodb.com/docs/stable/data-models.html Accessed Accessed 07 July 2023
63. Ali, W; Saleem, M; Yao, B; Hogan, A; Ngomo, A-CN. A survey of rdf stores & sparql engines for querying knowledge graphs. VLDB J; 2021; 31,
64. Barceló Baeza P. Querying graph databases. In: Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. PODS ’13, pp. 175–188. Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2463664.2465216.
65. Zhou, X; Li, G; Feng, J; Liu, L; Guo, W. Grep: a graph learning based database partitioning system. Proc ACM Manag Data; 2023; [DOI: https://dx.doi.org/10.1145/3588948]
66. Vaikuntam A, Perumal VK. Evaluation of contemporary graph databases. In: Proceedings of the 7th ACM India Computing Conference. COMPUTE ’14. Association for Computing Machinery, New York, NY, USA (2014). https://doi.org/10.1145/2675744.2675752
67. Davoudian, A; Chen, L; Liu, M. A survey on nosql stores. ACM Comput Surv; 2018; [DOI: https://dx.doi.org/10.1145/3158661]
68. Dang, EKF; Luk, RWP; Allan, J. Fast forward index methods for pseudo-relevance feedback retrieval. ACM Trans Inf Syst; 2015; [DOI: https://dx.doi.org/10.1145/2744199]
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.