Content area
In this paper, we are concerned with data pertinent to transportation networks, which model situations in which objects move along a graph-like structure. We assume that these networks are equipped with sensors that monitor the network and the objects moving along it. These sensors produce time series data, resulting in sensor networks. Examples are river, road, and electricity networks.
Geographical information systems are used to gather, store, and analyse data, and we focus on these tasks in the context of data emerging from transportation networks equipped with sensors. While tailored solutions exist for many contexts, they are limited for sensor-equipped networks at this moment. We view time series data as temporal properties of the network and approach the problem from the viewpoint of property graphs. In this paper, we adapt and extend the theory of the existing property graph databases to model spatial networks, where nodes and edges can contain temporal properties that are time series data originating from the sensors. We propose a language for querying these property graphs with time series, in which time series and measurement patterns may be combined with graph patterns to describe, retrieve, and analyse real-life situations. We demonstrate the model and language in practice by implementing both in Neo4j and explore questions hydrology researchers pose in the context of the Internet of Water, including salinity analysis in the Yser river basin.
1 Introduction
Transportation networks are a common research subject. These networks range from river networks that transport water and road networks that transport vehicles to heat or electricity networks that transport energy. A transportation network is characterised by having a stable topology – that is, the connectivity between nodes does not change often – and objects or substances move through the topology. The objects or substances are measured or tracked by some sort of sensor, and these sensors create time series data. Using graphs to model transportation networks is a common practice based on the connection information often arising from vector-based Geographical Information System (GIS) data. Because these sensors themselves do not move, the time series can be attributed to the nodes and edges representing the vector elements. In the networks, there are also static data, which can be represented as common properties in the graph, such as the IDs of the vector elements. As a result, researchers have to process property graphs with time series in the nodes and edges. This is different from the field of temporal graphs because nodes and edges are always considered present here. Only the properties can change through time, in which case a property is considered to have different values at different points in time. In contrast, most work about temporal graphs assumes that nodes and edges can be removed or added, resulting in graphs where nodes and edges are linked to validity time intervals.
With the rise of the Internet of Things, the number of sensors in transportation networks has grown considerably, and the rate at which these sensors produce data is increasing too . Therefore, the time series can have a high resolution of data points. Nodes and edges are well-supported in graph databases, and query languages exist to process them. However, managing time series in graph databases is not trivial. Additional data structures are required to store the time series, and the existing query languages do not support time series to express constraints . Similarly, time series databases are well-established with their query languages to deal with the timestamps and values. However, adding the graph structure of the network is not supported. A system that fully integrates and supports both graph structures and time series data is missing.
In this work, we use an existing model for property graphs to create a property graph model that includes time series in nodes and edges. We leverage pattern matching, commonly used in the current graph language, to create a query language and logic that can answer queries over the proposed graph model. One crucial part is that, in the query language the temporal properties, with their timestamp–value pairs, can interact with the nodes, edges, and their static properties. The natural order present in the time series is built into the query language to simplify writing constraints in the queries. This provides tools for studying and transforming the data for advanced analysis. The proposed logic is well-suited to objectively describe and study problems or shortcomings in the model or query language. Next to the theory in this work, the usage of the model and implementation is demonstrated on a practical use case within the Internet of Water project where electrical conductivity data are analysed for the Flemish Yser river.
A very preliminary version of the theory, presented in this paper, was published in , without a theoretical framework and implementation. We want to use the present work to develop the theoretical foundation and to show how it can be implemented. In addition, we discuss experiments and additional examples here.
In Sect. , we discuss the related work, and in Sect. the theoretical model for property graphs with time series is given. Section defines the logic for the proposed query language and shows how this language can be realised using the Graph Pattern Matching Language (GPML). The section ends with some application examples. In Sect. , we describe our implementation of the graph model and query language using Neo4j and Cypher together with an experiment based on the Internet of Water project. Sections and discuss the proposed model and query language and present our conclusion.
2 Existing work
There is a long-lasting overlap between geographical information systems and the field of data management, especially if physical networks are studied. Our sensor networks closely align with the definition of “geosensor networks” by . She points out that data are often handled by domain scientists who need to work with the data, preferably without needing to know the lower-level details of data management. This is also demonstrated in practice by the research of for river basins, for traffic on road networks, and for electricity networks, where each network is a transportation network in our definition and is modelled using a graph. Especially in the case of studying rivers, a lot of work is happening in Belgium, where climate change has a big impact because of the increasing occurrences of extreme meteorological events . It is, therefore, not surprising to see parties involved in handling rivers joining forces and investing in full-stack approaches encompassing measuring, storing, analysing, and managing data, as in the Internet of Water project (
In recent years, two types of graphs, resource description framework (RDF) graphs and property graphs, have come to the foreground. For the latter, the recent definition used is the definition given by . give a slightly different definition of property graphs. Before that, other versions of graphs were used: for example, the weighted graph . use this model and define time-aggregated graphs. In a time-aggregated graph, the weight of nodes and edges can vary over time. An example is the travel time in a network. This is part of dynamic, or temporal, graphs where the study focuses on nodes and edges that exist during a certain time . There are also advanced systems, for example GRADOOP by , that incorporate scaling and many interaction possibilities. This field is still an active research domain, as the recently started project HyGrpah (
One of the first publications regarding the underlying logic or calculus for graph query languages is by from 1987 regarding a query language on graphs with recursion. In this work, the relational model is used to store data. An approach based on a graph model is called regular path expression and is described by . These ideas have been extended to regular path queries in tandem with the evolution of the graph model itself . Since then, different papers have been published that further build on regular path queries. show how this all fits together in the query languages. In their vision, the two basic elements of the graph query language are graph patterns and navigational patterns. The evolution is summarised by as the existence of regular path queries (RPQs) and conjunctive graph queries (CQs), which, when taken together, make u the language of conjunctive regular path queries (CRPQs). Adding disjunction to CRPQ results in a more expressive query language called union conjunctive regular path queries (UCRPQs). When considering edge labels in two directions, the theory also talks about two-way regular path queries (2RPQs), conjunctive two-way regular path queries (C2RPQs), and union C2RPQs . However, these language classes do not account for reasoning over the properties of graphs. To accommodate this, the class of regular property graph queries can be used based on regular property graph logic and regular property graph algebra .
Despite the history, practical query languages have only been implemented in recent years. In contrast to early graph interaction, which was mainly imperative API-like, recent query languages are declarative (API: Application Programming Interface). The most important existing languages are Cypher (OpenCypher and GQL), SPARQL, Gremlin, and GSQL. There is an effort to standardise the graph query languages, as was done for SQL. This new standard, called GQL, incorporates elements from (open)Cypher, G-core, PGQL, and Tigergraph's GSQL (
For time series, declarative time series query languages and imperative query approaches exist. An elaborate theory, as in the case of the graph databases, does not exist to the best of our knowledge. The first versions of the time series database InfluxDB (
3 A data model for property graphs with time series
The property graph model is well-suited to model transportation networks, and various definitions of this model exist: for example, the property graph model definition of and the definition of . In this work, the latter is used because it aligns better with our applications. Our definition of the “property graph with time series” model adapts the definition of Bonifati et al. in the book Querying Graphs () and adds time series based on the idea of .
Definition 3.1. Given are a set of values , a finite set of labels , and a set of timestamps (equipped with a total (temporal) order ). Further, we assume that a set of property keys is given, with two disjoint subsets (S stands for static) and (T stands for temporal). The sets , , , and are assumed to be pairwise disjoint. Lastly, the set denotes the set of measurements which are pairs.
A property graph with time series is then defined to be a structure where
-
is a finite set of nodes;
-
is a set of directed edges, where each edge belongs to ;
In the original work of , these two sets are disjoint subsets of a bigger object set. We simplify this approach by not considering this object set. In addition, our directed edges are defined by an ordered pair of nodes because of which the function is not needed.
-
is a partial function assigning to nodes and edges a finite set of labels (here, denotes the set of all finite subsets of );
-
is a partial function assigning a value to a static property of nodes and edges; and
-
is a partial function assigning a finite set of measurements to a temporal property of nodes and edges. We require that in such a set no timestamp appears twice, and we call the image of the function a time series.
Now, we introduce our simplified fictitious running example of such a graph, which is used throughout this paper.
Example 3.1. Our example of a property graph with time series contains seven nodes that all have a numeric identifier from to , and these make up the set of nodes . In addition, there are edges that represent the connectivity of the network. The connectivity is chosen to be unidirectional to reflect a river-like transportation network. There are six edges in total, being , and . A visual representation is shown in Fig. . Each edge is identified by a pair of node identifiers. All nodes and edges can have zero or multiple labels to categorise them. In this example, we label all nodes with the label
We remark that, in this example, the timestamps are behaving perfectly. That is, timestamps are spread evenly, and all time series use the same timestamps. This does not need to be the case, and in reality it probably will not be the case. However, the example is easier to understand by assuming these characteristics.
The complete formal description of the example is as follows, with the remark that in the time series, the timestamps are on 15 August 2022, but only the time is shown for readability.
-
;
-
;
-
, , , , , , , , , , , , ;
-
, , , , , , , , , , , , ;
-
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , .
This concludes our example, a visual representation of which is shown in Fig. .
Figure 1
Visual graph representation of Example 3.1. The nodes are assigned a label
[Figure omitted. See PDF]
The total temporal order on the set , assumed by Definition 3.1, induces a total order on the measurements in a time series. Indeed, the temporal order on induces an order (which we also denote by ) on , as follows: for , we define if and only if , where is the projection on the time component (that is, returns the timestamp of a measurement). Furthermore, we define to be the value of the measurement (that is, its second component). In other words, a measurement is the couple The above order allows us to define a “next” relationship and a “previous” relationship between measurements in a time series. The definition of both is as follows. Given two measurements and in a time series , is the next measurement after , denoted if and only if , and , and there is no measurement in with (here, has the obvious meaning of but not equal). The definition for the previous measurement, , then corresponds to . In the case that the second condition is dropped (that is, there might be another measurement between the other two), then we use predicates to indicate that occurred after and for the other case. In addition, we can extend this idea to define the first and last measurements. The and predicates are true for a measurement , within a time series , where there is no previous or next measurement for in , respectively.
4 A formal query language for property graphs with time series4.1 Extending regular property graph logic
The concept of regular path queries is well-established and is an important concept (; ; and ). For example, Cypher and GQL are based on the theory of regular property graph queries, described by . These queries can be described using regular property graph logic or regular property graph algebra. To obtain a query language for property graphs with time series, we extend the regular property graph logic with elements to incorporate the time series. More specifically, we add the description of time series patterns for a temporal property in a node or an edge. The result is that the temporal properties are handled just like the nodes, edges, and properties. A query can contain node, edge, path, static property, and temporal property constraints that can be evaluated simultaneously. These constraints may depend on each other without the need for multiple queries.
We take the syntax and semantic definitions of regular property graph logic by and with it define our logic for property graph query language with time series, or GQL-TS logic. This is done by adding formal descriptions for temporal properties and querying them. We show how these logical elements can be realised as time series and measurement patterns in the Graph Pattern Matching Language. In the last part, example queries are given to demonstrate the different possibilities of our graph query language with time series, which we abbreviate as GQL-TS. The entire definition of regular property graph language is also included in this section. We add to it elements that deal with temporal properties – that is, time series – and how these are concatenated with the constraints. For the syntax and the semantics, the existing theory is first given, and we subsequently introduce the new elements added by us.
4.1.1 Syntax of GQL-TS logic
In this section, we give the syntactical definition of GQL-TS logic.
A query, expressed in GQL-TS logic, is conceived to take a property graph with time series (as presented in Definition 3.1) as input, and it takes the form of a set of non-recursive datalog-style rules. Each rule is a description of a sub-graph pattern or path of the input graph and is of the form
1 which has a head predicate and, in the body, a sequence of body predicates, possibly in combination with additional constraints.
First, we need some additional notions and notations. We use to denote the (possibly infinite) set of variables that represent nodes in a graph. Similarly, we have the set of variables that refer to edges in a graph. Finally, we assume an infinite set of fresh predicate names of arity 2.
We define the components of a query rule in GQL-TS logic. First, we define the form of the “head” of a rule, which either creates a new relation (called ) or a result relation of some arity .
Definition 4.1. The head of rule () either has the form where , , or the form with and . In the first case, we define , and in the second case, we define .
Next, we define the form of the “body” components of a rule, which either uses an existing or previously created edge relation (called ), computes the transitive closure of such an edge relation (denoted by ), or uses an existing unary node relation (called ).
Definition 4.2. Each in rule () has the form
-
,
-
, or
-
,
Next, we define the form of the “constraint” components of a rule and distinguish between “static constraints” and “time series constraints”. In this definition, we need a set of measurement variables that range over measurements. When , then is a variable that ranges over values in and is a variable that ranges over timestamps in . Finally, we have a set of binary operators (or relations) on the set of values and a set of binary operators on , to which the total order on timestamps is assumed to belong. Typically, we take and to be .
Definition 4.3. Each in rule () is either a static constraint or a time series constraint.
(1) Static constraints are of the form
-
,
-
, or
-
,
(2) Time series constraints are of the form
-
, with ;
-
, with and ;
-
; or
-
,
-
, , and are
-
values from ;
-
of the form , with ; or
-
the result of a function (with co-domain in ) applied to value and time constants as well as variables.
-
-
, , and are
-
timestamps from ;
-
of the form , with ; or
-
the result of a function (with co-domain in ) applied to value and time constants and variables.
-
We are ready to give the definition of a rule in GQL-TS logic.
Definition 4.4. A GQL-TS logic rule has the form with and , where head, , and are as defined in Definitions 4.1, 4.2, and 4.3, with the additional restriction that
A query expressed in GQL-TS logic is then simply described by a (non-recursive) collection of GQL-TS logic rules. At least one rule in a query needs to have a special head
Definition 4.5. A GQL-TS logic query is a finite non-empty and non-recursive set of rules such that at least one rule has a head predicate
When the predicate
Example 4.1. To make this more tangible, two example queries are now discussed. Take the graph in Example 3.1: on it, we want to find all nodes with the same temperature at the same moment in time as node N1. In other words, we need a query that expresses two node patterns where one matches node N1 and the other a node with the same value at the same moment for the temporal property “temperature”. First the head of the rule is given. It states that the query will return all matches for the variables , , and , for which additional criteria are defined. Starting with the body predicates, and , match a node with the label “point” to the node variables and . The node's name matched to needs to be equal to N1, which is expressed by the constraint . This last part is already a constraint as given in Definition 4.3. The next four constraints focus on the temporal properties, with the first two matching a measurement and in the temporal node property temperature in node and , respectively. These measurements need to have the same value at the same moment in time, expressed by the constraints and . Finally, the constraint is needed to prevent the query from matching the same node to and . This is done with the name in this case but can be realised with any node property that can differentiate any two nodes.
Edges and the temporal properties of those are the focus of the second example query. In the example graph, we want to find two consecutive edges where the travel time on the second edge (in direction) is lower than the travel time on the first edge. This formal query example includes two edge variables in the header: and . Here, two different body predicates are used, and , that match edges with the label “path”. These predicates also explicitly link four node variables to the nodes that make up the edges. In this example, we named these variables , , , and . The constraints can be written using these variables, defining the requirements to which the matches must adhere. First, the constraint ensures that the two edges are consecutive because the end node of the first edge needs to be the same as the start node of the second edge. Secondly, constraints and again match a measurement in each edge's temporal property “travel time”. These measurements have to adhere to the constraints expressed on the last line: and , expressing the requirements for the values and timestamps.
4.1.2 Semantics of GQL-TS logicIn this section, we give the semantics of GQL-TS logic queries when applied to a property graph with time series. Two parts form the following semantics: (a) the bodies and constraints of rules are satisfied by an assignment, and (b) each valid assignment contributes elements to a set defined by the head of a rule. The first part will be called the satisfaction of rules, and the second part will be the evaluation of rules. We first define the rule evaluation and then discuss the satisfaction problem for the bodies and constraints.
Given a property graph with time series , a query , expressed in GQL-TS logic, consists of a finite set of rules , where a rule is a GQL-TS logic rule and where at least one rule in has as its head. We denote the predicate of the head of rule by head.
When a rule has head predicate , and only when the semantics of all predicates that are successors of in the dependency graph of have been evaluated on , we can define the semantics of the predicate on . We remark that this recursive process stops since the dependency graph of is acyclic.
The evaluation of a rule on results in the creation of a -ary relation, where is the arity of . This -ary relation is defined in terms of assignments of the form which assign identifiers to node, edge, and measurement variables. Suppose that rule has the form ; then we define to be the set of assignments for which in the usual sense used in predicate logic (for our specific constraints, in particular those on time series, we give the details below). If the head of has the form , then the evaluation of the rule on is the set The head of rule may appear in several rules of the query (possibly with other variable names). Then, we define the semantics of this predicate in to be The result of the evaluation of the query on the property graph with time series is then defined to be the -ary relation What remains to be specified is the meaning of and for a in a rule (as in Definition 4.2) and in a rule (as in Definition 4.3).
Satisfaction of body predicates. We define the meaning of for each of the various forms of that we defined earlier in Definition 4.2.
A body predicate in a rule is satisfied by a mapping .
The following applies.
-
If is of the form and , then if an edge exists in such that and .
-
If is of the form and , then if an edge exists in with .
-
If is of the form , an additional set is needed before we can define the satisfaction condition. For this purpose, we introduce the following notation: which expresses that the tuple belongs to (1) the transitive closure of the edges in with label when is an existing label in or (2) the transitive closure of a newly created relation, labelled , when is not in the original graph . The set is then defined as We want to remark that the meaning of this set has to be interpreted as “zero or more times an edge with label ”. Therefore, the pairs are included to ensure that assignments where the edge occurs zero times are possible. Finally, if .
-
If is of the form with , then if and .
Satisfaction of constraints. We define the meaning of , for each form of that appears in Definition 4.3. For each of the comparison relations and , we also use this symbol for their interpretation in the structure (abusing notation).
-
If is of the form , then if and .
-
If is of the form , then if and .
-
If is of the form , then if .
For the constraints concerning the time series, two additional functions are assumed to exist. We assume a function exists, with co-domain in , that takes as input value and time constants as well as variables. We denote this function with . Similarly, the function taking value and time constants as well as variables with co-domain is represented with .
-
If the form of is , then if
1.
,
- 2.
, and
- 3.
for all values with and holds.
-
If the form of is , with and , then if
- 1.
,
- 2.
, and
- 3.
for all values with and holds, except for , and then holds.
- 1.
The following applies.
-
If has the form and is of the form and , then if and .
-
If has the form and is the result of with and , then if and .
-
If has the form and and are value constants, the result of applied to time constants, or the result of after is applied to variables, then if .
The following applies.
-
If has the form is and is of the form and , then if and .
-
If has the form is and is the result of with , then if and .
-
If has the form is and and are value constants, the result of applied to value constants, or the result of after is applied to variables, then if .
Example 4.2. In order to have a more practical understanding of the evaluation of the rules, the two queries introduced in Example 4.1 are evaluated here.
The first query contains one rule with head
-
: none.
-
: , ; , .
-
: none because there is already no mapping for .
-
: , ; , .
-
: none.
-
: none.
-
,
-
,
-
, or
-
.
Only the mappings for the variables and the final result are discussed for the second query. In theory, all possible mappings of edges need to be considered for and . However, the constraint only allows consecutive edges. In addition, the constraint with and quickly reduces the possible number of mappings to only one: and . As last time, all possible measurements need to be matched, considering the constraints and . Only one mapping with and is able to do this. Therefore, the result of this query contains only one entry: . We remark that, if other measurements in these temporal properties yielded a valid mapping, the result would remain the same since the union would eliminate the duplicate entry for and . If the measurements themselves were included, then this would not happen.
4.2 Graph query language with time seriesIn this section, the GPML notation is extended with elements to describe time series with time series patterns consisting of measurement patterns. Important is that, in a time series pattern, two consecutive measurement patterns indicate that the measurements matched need to fulfil the relationship. We start with an introduction of the existing GPML notation of queries. These express graph patterns matching nodes and edges. The ability to express temporal properties in these nodes and edges is added by us in the subsequent subsection. First, the syntax of a measurement pattern is described. Afterwards, we demonstrate how these can be combined into a time series pattern, and finally, these time series patterns are linked with existing GPML notation. When all these elements are described, then we will discuss a set of example queries that demonstrate the use of the newly developed notation.
4.2.1 GPML introduction
To recap, we will first show some important notations of GPML. We refer to for a complete description of the notation. A pattern in GPML describes the structure to which sub-graphs need to adhere to be considered a match. Additional information, such as properties and labels, involves constraints added to the pattern.
A node is described by a set of round brackets, a possible name, and a label within the brackets separated by a colon. Property constraints can be expressed after the
An edge is represented by two square brackets and an arrow. The pattern can contain a label, an edge type, a variable name, and properties. The notation for these three elements is exactly the same as for node patterns.
Finally, a query consists of node and edge patterns, preceded by the keyword
4.2.2 Measurement pattern
We consider a measurement a timestamp–value pair. A measurement pattern
4.2.3 Time series pattern
A time series consists of one or multiple measurements with an order on their timestamp. Therefore, a time series pattern is one or multiple measurement patterns that describe a subset of the time series together. Two concatenated measurement patterns express two measurements for which the predicate holds. This means that for the pattern
In GPML, some quantifiers allow patterns to be repeated. The Kleene star is the most generic but is accompanied by three additional quantifiers. These quantifiers can be used in the time series patterns. However, the meaning is slightly different. Between two measurements,
4.2.4 Time series patterns as temporal properties
Finally, all we have to show is how the time series patterns are linked to node and edge patterns. The same notation as for the static properties is used to link the time series patterns to a graph pattern. To express a temporal property of a node or edge, a time series pattern needs to be added to the node or edge properties, together with the name of the temporal property. This is expressed by writing, after the
4.2.5 Examples
The following are queries for the running example, given in Example 3.1 and shown in Fig. .
Example 4.3. In this example, the query's result is the temperature at 14:00 in a node with the name
There is only one match in the example graph – that is, node
Example 4.4. In this second example, two different nodes are matched, requiring both to have a temporal property
In the example graph, this query will produce eight matches: one where
We remark that there are 16 matches. The same match can be made for each match given with the nodes reversed for
Example 4.5. The following example focuses on variable length paths and the variable-length time series patterns.
The temporal properties are treated at the same level as static properties, nodes, and edges.
We refer to Sect. that shows an equivalence between temporal properties and nodes and edges in the graph.
This means the dependency between temporal and static properties can be expressed within one query. The following example depicts a situation where node or edge properties determine the time a temporal property needs to be constrained.Example 4.6. In this example, the query matches measurements based on values that occurred earlier in the path we want to match. The time that the
There is more to be said about the equality of time stamps in time series. One cannot easily assume that the exact same timestamp is present in two different time series. However, for now, this is assumed.
Compared to procedures implemented in a query language or methods implemented using drivers for the databases, the method presented here gives more possibilities for writing constraints. In addition, early pruning of the graph or time series matching is now possible. If they are treated separately, the graph part needs to be retrieved first, and subsequently, the time series need to be queried, or vice versa. This does not provide the same flexibility as our method.
5 Implementation
To demonstrate the idea in practice, we developed a database system that can store graphs as defined in Definition 3.1 and query it with our proposed query language. Based on the additions described in Sect. , an existing query language is extended to facilitate queries considering the new temporal properties. We chose to realise this with the Neo4j database. It is well-established in practice and provides a good foundation. The open-source query parser from OpenCypher, used in Neo4j, provides a suitable starting place for the query language implementation. Adding other functionalities to Neo4j is possible by using user-defined functions and procedures.
5.1 Storing in practice
There are different options to store the actual time series in the graph database. The storage approach is indifferent of the model presented earlier but usually supports either efficient storage or easier querying. In this work, we use the concept of a full-graph-based approach. That is, to store the time series in the graph database, we use the concept of the full graph model. This means the graph database is used to store the network's topology and time series. The latter are represented as linked lists of nodes in the graph. For each measurement, a node containing the timestamp and value of the measurement is created. These measurements can be linked by edges, connecting all measurements in the temporal order. This means that for each time series, there is a linked list of nodes where each node is a measurement. A head time series node is added to these lists when the time series is linked to a node (of the spatial part). This way, the node of the topology and the head nodes can be linked by a specially typed edge. For edges, this is not possible. There, a similar link can be established by adding a unique ID in the head node of the time series and the static properties of the edge. This would also be possible for nodes, leading to a more consistent design. But we determined that it would impact the query performance, although future research has to verify this. We chose to link the oldest measurement as the first measurement to the head node. This means the latest measurement is stored at the end of the linked list. We selected this approach because, this way, the order of measurements described in the queries corresponds to the order in which they are matched. It could be interesting to store the most recent measurement first because users might be more interested in recent measurements. Of course, the full graph model has a noticeable impact on the total number of nodes and edges stored, as we will show. However, practical experiments have not yet shown any limitations.
5.2 Querying in practice
A new query parser was implemented using Antlr (
Our grammar for the query language is based on the G4 Cypher grammar. The original grammar specification can be found on the OpenCypher Git repository.
It was deleted on 17 October 2022 (commit 38b5e39). We use the version available on the repository on 27 July 2022.
The grammar describes valid measurement and time series patterns by adding the following rules.We first define the grammar describing a measurement itself. Therefore, a
The next three elements are all grammar rules we need to describe a series pattern including the name and measurement patterns.
To link these series patterns to the existing property patterns, only the rule expressing the mapping for properties needs to be updated to include the time series pattern.
5.3 Experiment
This section focuses on implementing the proposed theory and testing its feasibility. An experiment was conducted to study water quality measurements in the Internet of Water project as a proof of concept. In this section, the tests are designed to determine whether domain experts can express their questions in the newly designed system. A benchmark or performance study is not yet included, but we do report some performance statistics to give some context.
In Flanders, the Internet of Water (IoW) project (
In this set-up, we study the Yser river west of Flanders, depicted in Fig. . It is a region prone to salt intrusion because it is close to the sea. In dry periods, salty seawater will push land inwards along the river and intrude via groundwater . For this reason, 42 electrical conductivity (EC) and water temperature (WT) sensors have been placed on the river, and we use the resulting measurements to pose typical questions.
Figure 2
Overview of the Yser river in the west of Belgium. Only a subset of the stations is shown. There are more stations in the data set, but they are not drawn to reduce clutter.
[Figure omitted. See PDF]
To model the river, we use the Vlaamse Hydrographic Atlas (VHA), a digital GIS data set representing all rivers in Flanders. In this data set, the rivers consist of smaller river segments, represented as line geometries. Sensors are attributed to segments, and because the researchers are interested in studying the river per segment, we chose to model each segment as a node in the graph. If water flows from one segment to another, the two nodes, representing those two segments, are connected by an edge indicating the water flow. In total, there are 534 segments in this use case. The graph thus contains 534 nodes, and they are connected by 541 edges. Distributed over this network are 42 sensors, each producing EC25 measurements every 15 min.
Disk usage of the Neo4j database for the Yser data.
| Node storage | Edge storage | Property storage | Total storage |
|---|---|---|---|
| 5.93 MB | 13.49 MB | 32.53 MB | 53.04 MB |
Before we discuss the queries in detail, some additional information about the use case is needed. Most of the sensors in the Internet of Water are numbered and named: for example, IOW1 for Internet of Water sensor number 1. Some sensors have slightly different names because they already existed before the project. This name can be used to identify the series of the sensor in the graph uniquely. Our implementation allows users to add labels to the series, as is possible for nodes and edges. For this use case, all series are EC25 measurements, so we assign the label “ec25” to each series. To identify a series by name or label, the following notation is used:
Query 1. Every segment is uniquely identified with the vhas number (the ID of the segment). In this first query, we are going to look up the location of a sensor by querying for the vhas of the segment. The
Query 2. The second example query is a typical question where the value of a specific moment and location is requested. In this case, the value of the IOW1 sensor at midnight on 30 March.
Query 3. With this example, we show how a label can access a series with a specific type. The names or locations might be unknown, but the series type is known. Here, we want to access EC measurements and see their values on 30 March at midnight. The results are shown in Table , where, next to the value, the vhas of the segment is returned where the series is on. In total 36 records are returned, but we truncated the table for readability reasons.
Result table containing the first 10 records from the answer to Query 3.
| n.vhas | a.value |
|---|---|
| 6042697 | 7441.90 |
| 6018926 | 2899.00 |
| 6033571 | 777.16 |
| 6021226 | 3705.00 |
| 6021185 | 1470.00 |
| 6033487 | 751.89 |
| 7073733 | 2388.43 |
| 6033467 | 10362.00 |
| 6033467 | 7842.00 |
| 6033615 | 8597.00 |
| … | … |
Query 4. This query demonstrates a more advanced series pattern. Because consecutive measurement patterns match consecutive measurements in the series, it is possible to describe a peak. A peak is described as three measurements
Five first records retrieved by Query 4 showing five peak values in sensor IOW1.
| b.timestamp | b.value |
|---|---|
| 2022-01-01T01:00:00Z | 536.72 |
| 2022-01-01T02:00:00Z | 536.17 |
| 2022-01-01T06:30:00Z | 557.84 |
| 2022-01-01T07:30:00Z | 557.15 |
| 2022-01-01T11:15:00Z | 550.97 |
Query 5. This query demonstrates how surpassing a threshold in a series can be identified. Such a query interests scientists who, for example, want to find river pollution spills. They know that the EC value would pass a threshold in the event of a spill, but they do not know if and where this happens. The following query matches measurements passing a threshold and returns where and when this occurred.
Query 6. The values of the measurements can be used to derive analytical results. In this case, the query retrieves a series at a specific location and calculates a moving average window. Every possible pattern match is returned as a record, and each record selects five consecutive measurements.
Records showing the average moving window returned by Query 6.
| a.timestamp | (a.value + b.value + c.value + |
|---|---|
| d.value + e.value) 5 | |
| 2022-01-01T00:00:00Z | 537.16 |
| 2022-01-01T00:15:00Z | 536.79 |
| 2022-01-01T00:30:00Z | 536.53 |
| 2022-01-01T00:45:00Z | 536.31 |
| 2022-01-01T01:00:00Z | 536.20 |
| 2022-01-01T01:15:00Z | 536.09 |
| 2022-01-01T01:30:00Z | 536.10 |
| 2022-01-01T01:45:00Z | 536.35 |
| 2022-01-01T02:00:00Z | 536.69 |
| 2022-01-01T02:15:00Z | 537.12 |
| 2022-01-01T02:30:00Z | 537.57 |
| 2022-01-01T02:45:00Z | 538.22 |
| 2022-01-01T03:00:00Z | 539.80 |
| 2022-01-01T03:15:00Z | 541.69 |
| 2022-01-01T03:30:00Z | 543.76 |
| 2022-01-01T03:45:00Z | 546.06 |
Query 7. Using more advanced graph patterns is useful to describe network structures where the location might be unknown. As this query shows, it looks for a path between two given locations in the network and retrieves the value measured on 30 March at midnight along possible paths. Specifically, the query looks for each ec25-labelled series on a possible path between segment 6031906 and segment 6033656.
Query 8. In the IoW project, the water team of VITO established different use cases and models to analyse the Yser river and its water quality. One of the results is the IGOR tool , which is used to analyse electrical conductivity on the Yser river. IGOR provides the ability to interpolate the sensor's measurements for any given location on the river. This interpolation query is the final analysis step to demonstrate and test the database in this context. In the graph, it should be possible to do the same interpolation for any location. Locations in the graphs are considered to be the nodes, and the nodes correspond to segments. This would mean the interpolation can be done for each segment, with or without a sensor. This matches the closest sensor upstream and downstream of the specified location. Subsequently, it interpolates the values for all the provided points in time linearly to the distance between the sensors and the location. This means that, for a given location, the entire time series is interpolated as long as both sensors have a value for the same moment in time.
We acknowledge that it is possible to use existing temporal graphs to model and query transportation networks. There is work from that models property graphs with categorical properties through time. Here, we explicitly want to model time series, possibly with a high resolution, and we want to emphasise that the network is stable, whereas the properties can take on many values through time. In addition, our approach exploits the fact that time series have a natural order and uses this in graph pattern expressions. We discussed the advantages of our query language compared to the procedures defined in addition to the existing query language. Nonetheless, procedures might need to be written because not everything is expressible in the proposed query language. That is, expressing constraints on each pair of nodes in a path where the path length is unbounded or expressing constraints on each pair of measurements where the number of measurements is unbounded is impossible.
The usage of labels in the experiment is a useful feature that was possible because the full graph model is used in the implementation. The implementation allows the labels of nodes to be used as labels for the time series. However, this use of labels for time series properties is not fully explored and defined. This should be realised in future steps to standardise the working, independent of the implementation method.
For each topology pattern, multiple measurement patterns can be valid in a time series. The number of matches in the end result is the sum of all valid patterns in the time series for each valid spatial pattern. At most, each measurement is a valid match for a time series pattern, and then the number of topological matches needs to be multiplied by the number of measurements in the relevant time series. This can impact the efficiency. Based on the model and logic presented here, this can now be studied more formally.
At the time of writing, the system is being demonstrated to and evaluated by expert researchers in hydrology and energy fields to evaluate the performance further and identify additional functionality. Therefore, the experiment shown should be seen as a proof of concept. A performance analysis or benchmark is not conducted because the current implementation is not mature enough, and such a study would lead to false results. For example, the current translated queries rely on the transitive closure computation (the “*” operator) in Neo4j. However, earlier research has shown that this operator cannot handle path queries with unbound length as efficiently as other functions . However, we would like to indicate the usability based on this proof of concept. The database Neo4j (version 4.2.3) was deployed on a server with Ubuntu 20.04, a two-core system (Intel Xeon Gold 6136 CPU at 3.00 GHz) with 16 GB of memory. Queries 1 to 7 ran and retrieved results between 399 ms and 299 s (299 358 ms), including transferring the records over the network to the local computer. Also, 1 min query times occurred for queries 3, 7, and 8. The translation of the queries from our language to full Cypher was performed between 5 and 382 ms. Again, we want to stress that all these results merely indicate the current state. The implementation is focused on functionality and not performance-oriented, nor are the timings conducted repeatedly. An objective study needs to be conducted to validate the performance of an implementation that exhibits a higher level of technological readiness.
7 Conclusions
Transportation networks are a recurrent research topic and can be modelled using property graphs. With the rise of the Internet of Things, measuring the status of the networks has become easier and the data volume has increased in terms of the number of sensors and resolution of the measurements. These sensors produce time series that, ideally, should be included in the property graphs. Temporal graphs focus on nodes and edges being valid at a certain time. In contrast, transportation networks rarely change their topology, but the measurements are properties of nodes or edges, where the values change through time. Graph databases also exist for storing property graphs, as do time series databases to store the measurements, but a combination, exploiting the characteristics, is missing. In this work, a property graph model with time series is proposed to provide a basic model. This model considers time series on nodes and edges together with the traditional properties as first-class citizens. The model is accompanied by a query language logic, exploiting the natural order in time of the measurements, to describe graph patterns that can be matched. In addition, the paper shows how this logic can be realised in graph query languages based on the Graph Pattern Matching Language (GPML). This all together creates property graphs with time series and leads to a tool set for querying transportation networks or transforming them for more advanced processing. For the first time, it is possible to express patterns where there are constraints that take into account nodes, edges, properties, and measurements of time series – that is, timestamp and value pairs – at the same time. These can all depend on each other, providing the possibility to search and evaluate graphs on all these constraints at the same time. There is more work needed to formally study these graphs with respect to complexity and evaluation, as the increased data size might impact the query evaluation. However, there are also possibilities to exploit optimisations. For example, the new constraints can prune path searches earlier, or path searches can be limit the number of time series that need to be processed. The next step is to realise all this in practice by implementing the property graphs with time series in an actual database and providing the proposed query language together with a query engine that processes the queries. With this theory, a first step is set to store, process, and then also analyse transportation networks based on property graphs with time series. This, in turn, can support research concerned with these networks such as river networks, electricity networks, and heat networks, amongst others.
Data availability
The sensor data used in this paper are owned by the Flemish Environmental Agency and made available at
Author contributions
All authors contributed equally to this paper's conceptualisation and writing (review and editing). Erik Bollen did the investigation, software visualisation, and writing (original draft preparation). All of this is under the supervision of Rik Hendrix and Bart Kuijpers. All authors have read and agreed to the published version of the manuscript.
Competing interests
The contact author has declared that none of the authors has any competing interests.
Disclaimer
Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.
Acknowledgements
The authors would like to thank Alejandro Vaisman and Valeria Soliani for fruitful discussions on property graphs with time series data.
Review statement
This paper was edited by Fernando Nardi and reviewed by Maxim V. Philippov and one anonymous referee.
© 2024. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.