Learning Spatial-Temporal Features of

Full text

Turn on search term navigation

This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

1. Introduction

A taxi is one of the key public passenger transports for daily travel activities in urban cities. Recently, ride-hailing service platforms, such as Didi, Uber, and Lyft, have grown tremendously, which provide convenient ride services by connecting passengers and drivers via smartphone apps. Ride-hailing services have various advantages compared to taxis but still suffer from the spatial and temporal imbalance between the supply and demand [1–3]. To provide more convenient and efficient services, ride-hailing demand forecasting is of great importance. Accurate ride-hailing demand prediction can help enhance user experience, increase taxi utilization, and optimize traffic efficiency.

In recent years, ride-hailing demand prediction has attracted a wide range of research interests. To realize more accurate and reliable demand predictions, both traditional time-series and machine learning models are thoroughly studied by researchers [4–12]. Among these works, different deep learning architectures have been developed to model the complex spatial and temporal dependencies among regions. Specifically, the convolutional neural network (CNN) and graph neural network (GNN)-based modules are used to capture spatial dependencies, while the recurrent and recursive network (e.g., RNN and LSTM)-based modules are used to learn temporal dependencies [13–15].

Most of the existing studies focus on how to model the spatial and temporal properties of the demand data but ignore the local statistical differences among regions. However, due to the local statistics varying from region to region, the parameter-sharing scheme at the spatial level may not be valid. For example, radial cities often have a typical central structure, and it is inappropriate to use the same parameters to predict the demands for both the central and peripheral areas of the city.

On the other hand, the models that process the temporal information by RNNs (or LSTMs) may not be able to accurately model the connections between spatial and temporal dynamics, and the training of the recurrent and recursive networks is time consuming. In addition to the recurrent architectures, experiments in many studies indicate that convolutional operations can be even more suitable for sequence modeling after appropriate modifications [7, 16–19].

In this paper, we design a fully convolutional architecture named LC-ST-FCN (locally connected spatial-temporal fully convolutional neural network) to address the issues mentioned above. Specifically, in the LC-ST-FCN model, the 3D convolutional operations are used to extract features from the spatial and temporal dimensions simultaneously. The locally connected convolutional layers are adopted to deal with the local statistical differences among regions. Due to the fully convolutional architecture of the LC-ST-FCN, the spatial coordinates of regions are maintained throughout the process, and no spatial information is lost between layers. Hence, the prediction error of each region can be delivered to the previous layers independently during the backpropagation training stage and optimize the corresponding model parameters of the region.

We evaluate the proposed model on a real dataset from a ride-hailing service platform (DiDi Chuxing) and observe significant improvements compared with a bunch of baseline models. Besides the predictive performance, we visualize the feature maps produced by the LC-ST-FCN model to investigate the working mechanism of the proposed model. The analysis results illustrate that (1) the fully convolutional architecture can accurately localize the most related regions for each target region, (2) the spatiotemporal features extracted by 3D convolutional layers are more powerful and suitable for the subsequent layers to learn, and (3) the locally connected layers can deal with the issue of local statistical differences.

The remainder of the paper is organized as follows: Section 2 reviews the existing literature. Section 3 describes the LC-ST-FCN model in detail. Section 4 compares the predictive performance between the proposed approach and the baseline models based on the real-world dataset extracted from DiDi Chuxing. Section 5 analyzes the spatial correlations among regions and the effectiveness of model structures. Section 6 concludes the study and proposes future research directions.

2. Related Work

With the rapid development of electronic sensors and wireless communication technology, real-time demand prediction using the spatial and temporal data collected by the Internet and mobile terminals has become a research hotspot in the field of transportation. For the consideration of the temporal dependencies among the traffic data in various time intervals, a series of time-series [4, 5, 11, 20] and machine learning [21, 22] methods are adopted to solve traffic prediction problems. However, these time-series methods are not capable of handling spatial dependencies, which limits their performance.

In recent years, deep learning methods are increasingly being applied to transportation research and significantly improve the accuracy of many traffic prediction tasks, including traffic flow prediction, traffic congestion forecasting, and traffic demand prediction [23–25].

To achieve accurate region-based forecasting, a number of deep learning models were designed to model complex spatiotemporal information, and state-of-the-art results were achieved [6–10, 12]. The research in [19] studied the predictive performance of deep neural networks (DNNs) on taxi demand forecasting problems, and their experimental results show that DNNs indeed outperform most traditional machine learning techniques, but such superior results can only be achieved with proper design of the right DNN architecture, where domain knowledge plays a key role. Recently, graph-based models, especially spectral-based graph convolutional networks, have attracted a lot of attention due to the effective representation of the graph structure data [26–29]. However, when it comes to the topic of taxi demand forecasting, graph-based models heavily rely on the graph structures generated based on artificial experience. Most of the previous studies have focused on the combinations of CNNs and RNNs, where RNN architectures were utilized to model temporal dependencies. The experimental results in [18] indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets.

Different from the abovementioned previous studies, 3D convolutional operations are employed to simultaneously capture the spatial-temporal dependencies in this paper. 3D convolution-related models have achieved superior performance on many types of video analysis tasks, such as action, scene, and object recognition [17, 30, 31]; since for video analysis tasks, it is very important to effectively extract the spatiotemporal features from adjacent video frames. Inspired by the findings on face verification [32, 33], in our work, locally connected convolutional layers are used at the end of the model to obtain the final prediction results, without parameter sharing.

3. LC-ST-FCN Model

In this study, we partition a city into an $I \times J$ grid map based on the longitude and latitude, where a grid cell denotes a region. We divide the observation time period (e.g. one month) into $T$ time intervals with the interval length $Δ t$ (e.g., 10 minutes). At the time interval $t$ , the order demand generated in the region $i, j$ is $x_{t}^{i, j}$ , and the demand matrix of the entire $I \times J$ region is $X_{t} \in ℝ^{I \times J}$ .

Our goal is to predict how many orders will emerge during the future period $t * Δ t, t + 1 * Δ t$ in each region. That is, given the historical observations $X_{0}, X_{1}, \dots, X_{t - 1}$ , $X_{t}$ is predicted. The LC-ST-FCN model shown in Figure 1 is developed to achieve this goal. Each input sample of our model is generated from historical observations and consists of two parts, the recent dependent sequence and the period-dependent sequence. We concatenate them together along the time dimension as one tensor $X_{t}^{in}$ . 3D convolutional operations are used to fuse the spatial and temporal information into multiple time intervals. Since 3D convolution is a natural extension of 2D convolution, features from both spatial and temporal dimensions can be learned simultaneously. Then, multiple 2D convolutional layers are used to extract and encode features from low to high level, and from local to global. With the increase in network depth, the receptive field of neurons also increases, allowing an increasing amount of spatial information to be learned. In particular, locally connected convolutional layers are used at the end of the model to obtain the final prediction results, without parameter sharing.

[figure(s) omitted; refer to PDF]

According to the findings in 2D CNNs [34], small receptive fields of 3 × 3 convolutional kernels with deeper architectures yield the best results. Hence, in our LC-ST-FCN model, we fix the spatial receptive field to $3 \times 3$ and vary only the temporal depth of 3D convolutional filters. The 3D convolutional filters have access to information across all input demand matrices after several layers, depending on the depth of the input data. The important components and training methods of the LC-ST-FCN model are introduced in the following sections.

3.1. Input: 3D Volume

As a class of attractive deep models for automated spatial feature construction, CNNs have been primarily applied to 2D images. However, for short-term passenger demand forecasting, it is also important to capture the temporal information in multiple adjacent or periodic time intervals. For instance, the traffic demand is expected to be high during peak hours and to be low during the off-peak hours of the day, and the traffic demand patterns of weekdays are very different from those of the weekend. Therefore, we construct the following input data forms, which integrate the spatial-temporal information of multiple time intervals into one 3D volume, with one demand matrix for each time interval: $\begin{matrix} (1) & X_{t}^{in} = X_{t}^{rec}; X_{t}^{per}, X_{t}^{in} \in ℝ^{I \times J \times 2 τ}, \\ X_{t}^{rec} = X_{t}, X_{t - 1}, \dots, X_{t - τ + 1}, X_{t}^{rec} \in ℝ^{I \times J \times τ}, \\ X_{t}^{per} = X_{t - l + 1}, X_{t - l}, \dots, X_{t - l - τ}, X_{t}^{per} \in ℝ^{I \times J \times τ}, \end{matrix}$ where $τ$ is the number of time intervals in $X_{t}^{rec}$ and $X_{t}^{per}$ , $X_{t}^{rec}$ represents the recent term, $X_{t}^{per}$ represents the period term, and $l$ denotes the number of time intervals between $X_{t}^{rec}$ and $X_{t}^{per}$ . In urban cities, demand patterns always show the characteristics of periodicity. The reasons for the change in demand patterns on consecutive workdays can be very complicated. For example, the tail number limited policy has been implemented in many cities in China, and people are more likely to take a taxi when their cars are restricted. Thus, we use both current and historical data to improve the predictive performance of the model. As shown in Figure 2, each input instance used by the LC-ST-FCN model is a 3D volume containing $2 τ$ time intervals, with each time interval corresponding to a matrix of $I \times J$ grid values.

[figure(s) omitted; refer to PDF]

In this paper, the additive model [35] is used to explore periodicity, and we find that the demand pattern of a day is very similar to that of the same day a week ago. By decomposing the demand data with $l = 144$ (e.g., number of time intervals in a day) and $l = 1008$ (e.g., number of time intervals in a week), it can be observed that, when $l = 1008$ , the trend term is almost linear and the variance of the residual term is much smaller than $l = 144$ (as shown in Figure 3). Hence, we select $l = 1008$ to construct our training 3D tensor for training.

[figure(s) omitted; refer to PDF]

3.2. The Fusion of 3D and 2D Convolutions

In 2D CNNs, convolutions are applied to the 2D feature maps to compute features from spatial dimensions only. When the 2D CNN is applied to the demand forecasting problem, it is desirable to capture the temporal information in multiple time intervals, e.g., trends and periods. Although 2D CNNs can also take multiple time intervals as the input, after the first convolution layer, temporal information is collapsed completely. Hence, it is difficult to effectively extract temporal information by 2D convolutional operations. Compared to 2D convolution, 3D convolution has the ability to model the temporal information better, due to that 3D convolution preserves the temporal information of the input by maintaining the 3D volume as the output.

In this section, we first introduce the implementation of 2D convolutional operations and further define the calculation of 3D convolutional operations to show their differences. For the sake of simplicity and flexibility, we denote the input of these two operations as $X_{in} \in ℝ^{I \times J \times 2 τ}$ . In the 2D convolutional layer, a group of filters ${ℱ_{2 D}^{m}}_{m = 1}^{M}$ with size $ℝ^{H \times H \times 2 τ}$ and stride 1 are used to extract local features from the input 3D volume: $\begin{matrix} (2) & S_{2 D}^{t, m} = σ W_{2 D}^{m} * X_{in}^{t} + b_{2 D}^{m}, S_{2 D}^{t, m} \in ℝ^{I \times J \times 1}, \\ S_{2 D}^{t} = S_{2 D}^{t, 1}, \dots, S_{2 D}^{t, m}, \dots, S_{2 D}^{t, M}, S_{2 D}^{t} \in ℝ^{I \times J \times M}, \end{matrix}$ where “ $*$ ” denotes the convolutional operation, performing element-wise multiplications between $W_{2 D}^{m}$ and $X_{in}^{t}$ at any position when sliding the filter $ℱ_{2 D}^{m}$ across the width and height dimensions, $σ$ represents a nonlinear activation function (e.g., ReLU), $W_{2 D}^{m}$ is the weight matrix of the filter $ℱ_{2 D}^{m}$ with $b_{2 D}^{m}$ as the bias term, and $M$ is the number of filters in each group. For each filter $ℱ_{2 D}^{m}$ , after convolutional operations, its output $S_{2 D}^{t, m}$ will compress the temporal dimension of $X_{in}^{t}$ from $ℝ^{M}$ to $ℝ^{1}$ .

On the other hand, in the 3D convolutional layer, a group of filters ${ℱ_{3 D}^{m}}_{m = 1}^{M}$ with size $ℝ^{H \times H \times F}$ and stride 1 are also used to extract local features from the input 3D volume. Specifically, the depth size $F$ of each $ℱ_{3 D}^{m}$ is less than $2 τ$ , which enables the filters in the 3D convolutional layer to perform convolutional operations along the temporal dimension. Thus, the local temporal dependencies of the input can be extracted. $\begin{matrix} (3) & X_{in}^{t} = X_{in}^{t, 1}, \dots, X_{in}^{t, p}, \dots, X_{in}^{t, 2 τ - F + 1}, X_{in}^{t, p} \in ℝ^{I \times J \times F}, \\ S_{3 D}^{t, p, m} = σ W_{3 D}^{m} * X_{in}^{t, p} + b_{3 D}^{m}, S_{3 D}^{t, p, m} \in ℝ^{I \times J \times 1}, \\ S_{3 D}^{t} = \begin{matrix} S_{3 D}^{t, 1,1} & \dots & S_{3 D}^{t, 1, M} \\ ⋮ & ⋱ & ⋮ \\ S_{3 D}^{t, 2 τ - F + 1,1} & \dots & S_{3 D}^{t, 2 τ - F + 1, M} \end{matrix}, \end{matrix}$ where $W_{3 D}^{m}$ is the weight matrix of the filter $ℱ_{3 D}^{m}$ with $b_{3 D}^{m}$ as the bias term and $X_{in}^{t, p}$ is a data patch obtained by sliding over the temporal dimension. Given the temporal size of $X_{in}^{t}$ is $2 τ$ , $2 τ - F + 1$ patches can be produced. $S_{3 D}^{t, p, m}$ denotes the output of $ℱ_{3 D}^{m}$ on one patch $p$ , and the output of $ℱ_{3 D}^{m}$ on all the patches is $S_{3 D}^{t} \in ℝ^{I \times J \times 2 τ - F + 1}$ . Since the outputs produced by the 3D convolutional operations are 3-dimensional feature maps rather than 2-dimensional feature maps, it is more suitable to extract the temporal dependencies among the input. By stacking multiple 3D convolutional layers together, they can extract spatial and temporal features from local to global.

3.3. Locally Connected Convolutional Layer

The setting of parameter sharing in 2D convolutional layers assumes local features. For example, if a horizontal boundary in some parts of the image is regarded as important, then it will be equally useful in other places. However, the parameter sharing assumption may not be applicable to demand forecasting problems as local features vary from region to region. For example, radial cities often have a typical central structure, and it is inappropriate to use the same parameters to predict the demands for both the central and peripheral areas of the city.

In the LC-ST-FCN model, we relax the assumption of parameter sharing in 2D convolution layers by using the locally connected layers at the end of the model. In this way, the prediction value of each region is obtained by a different set of parameters, and the prediction error of each region can be delivered to the previous layers independently during the backpropagation training process. As shown in Figure 4(b), in locally connected convolutional layers, the filters in different spatial locations have different parameters. Compared with fully connected layers (Figure 4(a)) or 2D convolutional layers, locally connected convolution layers can simultaneously maintain local statistics and spatial coordinates.

[figure(s) omitted; refer to PDF]

3.4. Algorithm and Optimization

The LC-ST-FCN model can be trained by minimizing the mean squared error between the estimated demand ${\hat{Y}}_{t}$ and the real demand $Y_{t}$ . The objective function is shown in Equation (4), where $w$ and $b$ are both learnable parameters. $\begin{matrix} (4) & \min_{W, b} Y - {\hat{Y}}_{2}^{2} . \end{matrix}$

Algorithm 1 outlines the LC-ST-FCN training process. We first construct the training instance from the historical observations (lines 1–8), where $D$ is the set of training instances. Then, the LC-ST-FCN model is trained via backpropagation and the adaptive subgradient method (AdaGrad) (lines 9–28). Under the parameter-sharing scheme of 2D and 3D convolutional layers, the same $w_{m}$ and $b_{m}$ are used when we slide each filter $m$ across the width, height, and depth of each input instance in one minibatch (line 16). For each filter $m$ , in locally connected convolutional layers, different $w_{m}^{'}$ and $b_{m}^{'}$ will be used in different locations when we slide each filter across each input instance in one minibatch (line 22).

Algorithm 1: LC-ST-FCN training algorithm.

Input: Historical demand of each region: $X_{0}, X_{2}, \dots, X_{t - 1}$ ; lengths of input data sequence: $2 τ$ ; length of the period interval: $l$ .

Output: Learned LC-ST-FCN model.

(1) //construct training instance:

(2) $D \leftarrow \emptyset$

(3) for all the available time interval $t 1 \leq t \leq n$ do

(4) $X_{t}^{rec} = X_{t}, X_{t - 1}, \dots, X_{t - τ + 1}, X_{t}^{rec} \in ℝ^{I \times J \times τ}$

(5) $X_{t}^{per} = X_{t - l + 1}, X_{t - l}, \dots, X_{t - l - τ}, X_{t}^{p e r} \in ℝ^{I \times J \times τ}$

(6) // $X_{t}^{in}$ is the target at time $t$

(7) put a training instance $X_{t}^{in} = X_{t}^{rec}; X_{t}^{per}$ into $D$

(8) end for

(9) //Training:

(10) repeat

(11) Initialize the biases and weights at each layer;

(12) Sample minibatch from $D$ randomly;

(13) if in 2D or 3D convolution layers then

(14) for filters $m \in M$ do

(15) //Parameter sharing

(16) optimize learnable parameters $w_{m}$ and $b_{m}$

(17) end for

(18) end if

(19) if in locally connected convolution layers then

(20) for filters $m \in M$ do

(21) //Without parameter sharing

(22) optimize learnable parameters $w_{m}^{'}$ and $b_{m}^{'}$

(23) end for

(24) end if

(25) Calculate the stochastic gradient by minimizing the objective function (4);

(26) Update the parameters via backpropagation;

(27) until stopping criteria are met

4. Experiments

We perform experiments using a dataset from DiDi Chuxing (https://gaia.didichuxing.com), which is the largest ride-hailing service platform in China. The dataset includes customer requests in Chengdu, China, containing the request time, longitude, and latitude. After the raw data are cleaned, the dataset containing 7,031,022 requests within 103.85–104.30 longitude to the east and 30.48–30.87 latitude to the north is used in our experiments. All the requests are partitioned into 10-minute time intervals, and the investigated area is partitioned into $16 \times 16$ grids. Considering the periodicity of the data, we select the first three weeks of the dataset as a training set and the remaining nine days as an independent testing set. In the experiment, we set $τ = 10$ . Correspondingly, we use four 3D convolutional layers with a filter depth of $F = 3,5,7,8$ to extract the spatial-temporal information from inputs. The number of the 2D and locally connected convolutional layers is 4 and 2, respectively.

To evaluate the performance of the LC-ST-FCN model, we select a bunch of different models as benchmarks, where all the models are trained and validated with the same training set and test set. The baseline models are as follows:

(1) Additive Model. The time-series additive model is often used to decompose a time series into deterministic and nondeterministic components [35].

(2) Random Forest. The random forest is a powerful boosting tree-based method and is widely used in data mining applications. We train a random forest regressor with 200 trees (each limited to maximum of 20 nodes).

(3) ANN (Artificial Neural Network). We compare our method with a neural network of four fully connected layers. The numbers of hidden units are 128, 128, 64, and 64, respectively.

(4) CNN. The structure of the CNN is comprised of eight 2D convolutional layers and two fully connected layers with 256 outputs. The numbers of filters for convolutional layers from 1 to 8 is 32, 16, 16, 32, 32, 64, 64, and 32, respectively.

(5) LSTM. The LSTM is commonly adopted in sequence modeling due to its ability to capture both short- and long-term temporal dependencies. The LSTM model contains four LSTM layers with 128, 128, 64, and 64 hidden units each.

(6) Conv-LSTM. The Conv-LSTM structure can simultaneously learn spatial correlations and temporal dependencies [6]. The structure of the Conv-LSTM is comprised of four Conv-LSTM layers and two 2D convolutional layers.

(7) FCL-Net. FCL-Net [8] is one of the most powerful short-term passenger demand forecasting models. Conv-LSTM layers and convolutional operators are employed to capture the characteristics of spatiotemporal information. Each input consists of two terms, historical demand intensity and travel time rate, and is fed into two separate Conv-LSTM architectures.

(8) ST-ResNet. ST-ResNet [7] is a deep learning-based approach for traffic prediction. The CNN is used to extract features from historical images.

(9) LC-ST-FCN. The structure of LC-ST-FCN is comprised of four 3D convolutional layers, four 2D convolutional layers, and two locally connected convolutional layers.

(10) LC-LSTM-FCN. The combination of the convolutional layers and LSTM layers has been proved to be an effective structure in many tasks. The structure of LC-LSTM-FCN is comprised of four LSTM layers, four 2D convolutional layers, and two locally connected convolutional layers.

(11) LC-FCN. The structure of LC-FCN is comprised of eight 2D convolutional layers and two locally connected convolutional layers, where 2D convolutional layers are applied to the input to generate feature maps instead of 3D convolutional layers. Thus, the temporal dimension of input collapses after the first 2D convolutional layer.

(12) FCN. The structure of the FCN is comprised of ten 2D convolutional layers, where the 2D convolutional layer is used to obtain the final predictions.

We compare the proposed model against all the baseline models in terms of three widely adopted metrics, including the root mean squared error (RMSE), mean absolute percentage error (MAPE), and modified MAPE (SMAPE), which are defined as follows: $\begin{matrix} (5) & RMSE = \sqrt{\frac{1}{n} \sum_{i, j \in I, J} {y_{i, j} - {\hat{y}}_{i, j}}^{2}}, \\ MAPE = \frac{1}{n} \sum_{i, j \in I, J} \frac{y_{i, j} - {\hat{y}}_{i, j}}{y_{i, j} + ε}, \\ SMAPE = \frac{2}{n} \sum_{i, j \in I, J} \frac{y_{i, j} - {\hat{y}}_{i, j}}{y_{i, j} + {\hat{y}}_{i, j}}, \end{matrix}$ where $y_{i, j}$ and ${\hat{y}}_{i, j}$ are the ground truth and prediction value of demands in the region $i, j$ at the time interval $t$ , while $ε = 1$ is used to avoid the division by zero (otherwise, the MAPE equals to infinite when $y_{i, j} = 0$ ).

We train deep learning models, except ST-ResNet and FCL-Net, by implementing standard backpropagation on feedforward nets using the adaptive subgradient method (AdaGrad), and the initial learning rate is 0.01. The early stop is used, and the maximum number of training epochs is set to 100, while the batch size is set to 10. The training parameters of ST-ResNet and FCL-Net are consistent with [7, 8].

The best-performing model on the validation set was chosen for comparison and evaluation. In Table 1, the LC-ST-FCN model obtains the best results (i.e., the bolded values) on RMSE, MAPE, and SMAPE. Compared with ST-ResNet and FCL-Net, the LC-ST-FCN utilizes 3D convolutional layers to fuse the temporal components (e.g., short-term and periodic). In addition, the LC-ST-FCN considers the impact of local statistical differences which have been ignored in ST-ResNet and FCL-Net. To evaluate the effectiveness of the architecture of the proposed model, we compare it with other variants from three aspects. First, the LC-ST-FCN has better predictive performance than LC-FCN and LC-LSTM-FCN, which indicates that 3D convolutional operations capture the spatiotemporal information better than 2D convolutional operations and LSTM. Second, compared with the FCN, the predictive performance of LC-FCN is improved by simply replacing the last two 2 fully connected layers with locally connected convolutional layers. This implies that the local statistical differences among regions should be properly concerned. Moreover, both the LC-FCN and CNN are able to capture the local statistical differences among regions (the last two layers of them do not share parameters), but the LC-FCN performs better than the CNN. The reason may be that the fully convolutional architecture enables the model to focus on the neighboring regions of each target region by maintaining the spatial coordinates between the input and output.

Table 1

Comparison of the predictive performance.

Model	RMSE	MAPE (%)	SMAPE (%)
Additive model	1.86	24.78	29.92
Random forest	1.96	25.52	32.34
ANN	2.04	24.43	29.06
CNN	1.82	25.78	32.22
LSTM	1.82	26.95	35.76
Conv-LSTM	1.83	26.06	32.92
FCL-Net	1.77	25.05	32.48
ST-ResNet	1.75	24.81	31.30
FCN	1.78	27.36	33.85
LC-FCN	1.69	22.98	27.70
LC-LSTM-FCN	1.78	23.00	28.36
LC-ST-FCN	1.67	22.40	27.69

The bold values represent the best results given by the metrics (i.e., RMSE, MAPE, and SMAPE).

The validation error of the LC-ST-FCN and other deep learning models are recorded after each training epoch. In Figure 5(a), it can be observed that the training curve of the LC-ST-FCN drops quickly in the several initial epochs and then converges to a stable range, which indicates that the LC-ST-FCN model has a lower risk of overfitting. The training curves of the models with the convolutional LSTM architecture fluctuate markedly. On the other hand, the value of $τ$ is very critical. We perform the sensitivity analysis on these two parameters. We train the LC-ST-FCN model under different values of $τ$ with 100 training epochs. As shown in Figure 5(b), the RMSE of the LC-ST-FCN gradually decreases when the value of $τ$ increases, but there is no significant improvement when the value exceeds 10. Besides, the training time linearly increases with an increase in the value of $τ$ . Considering both the RMSE and computation time, we choose the value to be 10 in our experiments.

[figure(s) omitted; refer to PDF]

5. Analysis and Visualization

5.1. Spatial Correlations among Regions

The LC-ST-FCN model is developed to address the unique challenges of the passenger demand forecasting problem. Our model adopts the fully convolutional structure based on the assumption that learning the correlations among regions is of great importance. To validate the assumption, we calculate the pairwise Pearson correlation between regions by the following equation: $\begin{matrix} (6) & Corr Y_{i, j}, Y_{i^{'}, j^{'}} = \frac{E {Y_{i, j} - E Y_{i, j}}^{T} Y_{i^{'}, j^{'}} - E Y_{i^{'}, j^{'}}}{E {Y_{i, j} - E Y_{i, j}}^{2} E {Y_{i^{'}, j^{'}} - E Y_{i^{'}, j^{'}}}^{2}}, \end{matrix}$ where $Y_{i, j}$ and $Y_{i^{'}, j^{'}}$ are the demand time-series data of two regions. Figure 6(a) shows the Pearson correlations among all 256 regions, where the coordinates of each pixel point are obtained by computing $i * 16 + j$ (flatten the original grid regions) and the value of each pixel point represents the pairwise Pearson correlation coefficient. In Figure 6(a), we can see that the spatial correlation between each region and its surroundings does exist, but it has varying degrees. To further understand the relationship between the spatial correlation and the spatial distance, we calculate the grid distance ( $\sqrt{{i - i^{'}}^{2} + {j - j^{'}}^{2}}$ ) between regions shown in Figure 6(b). As shown in Figure 6(b), by and large, with an increase in the grid distance, the degree of correlation declines regularly. According to the correlation analysis results, we can conclude that there exist spatial correlations among regions, especially among neighboring regions. However, the Pearson correlation coefficient can only be used to measure the linear relationship between two variables, and the correlation among regions may be nonlinear. To better explain the spatial correlation among regions and reveal the working mechanism of the proposed LC-ST-FCN model, we perform a direct analysis of the spatial-temporal features and the activation maps produced by the LC-ST-FCN in the next section.

[figure(s) omitted; refer to PDF]

5.2. Evaluation of Model Structures

The experimental results show that the LC-ST-FCN model can realize accurate regional demand forecasting based on historical observations, which illustrates the effectiveness of the LC-ST-FCN. Besides the predictive performance, it is equally important to explore the reasons why the proposed model is more effective than other methods, especially other CNN-based models.

The motivations for building the model structure of the LC-ST-FCN are as follows.

First, for short-term passenger demand forecasting, detailed predictions in all regions of a city depend on both global and local features, which need to be encoded using spatial coordinates. Therefore, we adopt the fully convolutional architecture to maintain spatial coordinates.

Second, before 2D convolutional layers, using the 3D convolutional operations to fuse the spatial and temporal information in multiple time intervals can enable the network to learn more powerful features.

Thirdly, since the local statistics of regional demand data are not invariant, the spatial stationarity assumption of convolution cannot hold. Therefore, in the LC-ST-FCN model, the locally connected layers are utilized to obtain the final predictions.

To illustrate the effectiveness of the LC-ST-FCN model, it is necessary to investigate how the model gets the predictions from the input data. Meanwhile, whether each component of the model is working as we expected. To realize these purposes, we first sample one input data from our testing data set. As shown in Figure 7, the sampled input data include two parts, the recent term and the period term.

[figure(s) omitted; refer to PDF]

Then, we feed the sample data into the LC-ST-FCN and other models, including the LC-FCN, FCN, and CNN, in its control group. The comparison of their model structures is shown in Table 2. According to the type of layers, we divide each model into three blocks. To find out how different models transform the input data to obtain their prediction results, we visualize the activation values of their hidden layers.

Table 2

Comparison of model structures.

Model	Block 1 (1–4 layers)	Block 2 (5–8 layers)	Block 3 (9-10 layers)	FCN architecture
LC-ST-FCN	3D convolution	2D convolution	Locally connected	Yes
LC-FCN	2D convolution	2D convolution	Locally connected	Yes
FCN	2D convolution	2D convolution	2D convolution	Yes
CNN	2D convolution	2D convolution	Fully connected	No

Given the sample data, region 8-8 (grid number) is maximally activated by the output layer (Block 3) of all these four models. Taking region 8-8 as a target, we can correspondingly find out which parts of the input data are utilized by each model to obtain the final prediction of the target region.

The results reported in Figure 8 correspond to the hidden layers of each model listed in Table 2. For visualization and comparison purposes, we normalize the activation value between 0 and 1. Each heatmap image is generated by the activation maps of the corresponding hidden layer, and the brighter colors represent higher activation values. Besides each heatmap image, the scatter diagram is used to compare the spatial correlations detected by the Pearson correlation analysis and each network structure. In each scatter diagram, the horizontal axis represents the pairwise Pearson correlation between the target region and the others (including itself), and the vertical axis represents the activation value of each region.

[figure(s) omitted; refer to PDF]

From the results of Figure 8, we can observe as follows:

(1) From the heatmap images of Block 2 (the third column in Figure 8), we can find that the regions localized by the CNN model are different from those by the three other models. The CNN model activated more regions than the others (see the highlighted areas). According to the corresponding scatter diagram, many regions that have a weak linear correlation with the target region are also activated by the CNN model. Different from the CNN model, the regions activated by FCN-based models are significantly related to the Pearson correlation analysis results. Intuitively, the predictive performance will suffer if the model pays too much attention to weakly related regions, and the experimental results also confirm this view (shown in Table 1). Moreover, the relationships between the two variables shown in the scatter diagrams are not linear, indicating that the neural networks can capture not only the linear but also the nonlinear relationships among regions.

In Table 2, except for the CNN model, the other three models are all fully convolutional. Therefore, we can conclude that it is necessary to maintain spatial coordinates when we aim for detailed predictions in all regions of a city, and the fully convolutional architecture is very effective.

(2) In the first column of Figure 8, we can observe that the regions concerned by each model are almost the same. However, after 4 layers, the output of Block 1 (the second column of Figure 8) shows a great difference, where the output of the LC-ST-FCN model in this block is obtained by the 3D convolutional layer and the others are obtained by the 2D convolutional layer. From the comparison of the diagrams in the second column, it is interesting to observe that the two variables described in the diagram of the LC-ST-FCN model are still highly related to each other, but in the other three diagrams, the relationship between these two variables becomes hard to detect. The very reason for the situation mentioned above lies in that the 3D convolution layers and 2D convolutional layers used in this paper adopt different strategies to extract spatial-temporal features. The spatial correlation between regions is defined by the relationship between their time-series data, but the temporal information is collapsed after 2D convolutional layers. Consequently, the relationship between regions is also lost. Moreover, we also try another variant of the LC-ST-FCN by changing the order of 3D layers and 2D layers. As a result, the RMSE increases from 1.66 to 1.69. Thus, we can infer that the 3D convolutional layers have the ability to preserve temporal information better.

(3) To deal with the local statistical differences among regions, besides the features that are important for all the regions, the model should have the ability to capture features that are only useful for limited regions. More specifically, the objective function of the model needs to pay attention to region-wise errors. By putting the locally connected convolutional layers at the end of the model, the prediction value of each region is obtained by a different set of parameters, and the prediction error of each region can be delivered to the previous layers independently during the backpropagation training process. Thus, the use of locally connected convolutional layers can enable the LC-ST-FCN model to capture the statistics of different regions by enforcing the model to focus on the prediction errors of each region.

In the third column of Figure 8, we can see that the scatter diagrams of the first three models are very similar, which means that the useful regions concerned by these three models are almost the same. However, based on the activation values output by the last layer of Block 2 of these three models (without normalization), we can find that the models with locally connected convolutional layers can activate these useful regions more effectively (as shown in Figure 9). Therefore, we can conclude that locally connected layers play an important role in dealing with local statistical differences and activating useful regions.

[figure(s) omitted; refer to PDF]

6. Conclusions

In this paper, we proposed a fusion convolutional architecture named LC-ST-FCN to model the distinct features of the ride-hailing demand forecasting problem, including spatial-temporal dependencies, spatial coordinates, and local statistical differences. Besides the higher prediction accuracy, we reveal the working mechanism of the proposed model on the taxi demand data. The advantages and disadvantages of various deep learning architectures in addressing the challenges of the demand forecasting problem, including the capturing of spatiotemporal dependencies, learning of local statistical differences, and ability to localize most related regions, are systematically evaluated. A real dataset from the DiDiChuxing platform is used for model evaluation and comparison. In the experiments, our model outperforms all the benchmark models in terms of three widely adopted metrics. In this paper, the training data of our proposed model are generated by the grid-based partition, which is carried out independently of the model training. In the future work, we expect to explore a learnable partition algorithm that has the ability of automatic region partition by learning the demand statistics of ride-hailing service platforms.

Acknowledgments

The work described in this paper was supported by the National Natural Science Foundation of China (71861167001, 72001152) and the Fundamental Research Funds for the Central Universities (JBK2103009). Furthermore, the authors would like to express their great appreciation to Pro. Lu Li for her valuable and constructive suggestions during the planning and development of this work.

References

[1] L. Hu, Y. Liu, "Joint design of parking capacities and fleet size for one-way station-based carsharing systems with road congestion constraints," Transportation Research Part B: Methodological, vol. 93, pp. 268-299, DOI: 10.1016/j.trb.2016.07.021, 2016.

[2] Y. Liu, Y. Li, "Pricing scheme design of ridesharing program in morning commute problem," Transportation Research Part C: Emerging Technologies, vol. 79, pp. 156-177, DOI: 10.1016/j.trc.2017.02.020, 2017.

[3] M. M. Vazifeh, P. Santi, G. Resta, S. H. Strogatz, C. Ratti, "Addressing the minimum fleet problem in on-demand urban mobility," Nature, vol. 557 no. 7706, pp. 534-538, DOI: 10.1038/s41586-018-0095-1, 2018.

[4] Z. Deng, M. Ji, "Spatiotemporal structure of taxi services in Shanghai: using exploratory spatial data analysis," Proceedings of the International Conference on Geoinformatics, .

[5] L. Moreira-Matias, J. Gama, M. Ferreira, J. Mendes-Moreira, L. Damas, "Predicting taxi–passenger demand using streaming data," IEEE Transactions on Intelligent Transportation Systems, vol. 14 no. 3, pp. 1393-1402, DOI: 10.1109/tits.2013.2262376, 2013.

[6] X. Shi, Z. Chen, H. Wang, D. Y. Yeung, W. K. Wong, W. C. Woo, "Convolutional LSTM network: a machine learning approach for precipitation nowcasting," Proceedings of the International Conference on Neural Information Processing Systems, pp. 802-810, .

[7] J. Zhang, Y. Zheng, D. Qi, "Deep spatio-temporal residual networks for citywide crowd flows prediction," 2016.

[8] J. Ke, H. Zheng, H. Yang, X. M. Chen, "Short-term forecasting of passenger demand under on-demand ride services: a spatio-temporal deep learning approach," Transportation Research Part C: Emerging Technologies, vol. 85, pp. 591-608, DOI: 10.1016/j.trc.2017.10.016, 2017.

[9] X. Ma, Z. Dai, Z. He, J. Ma, Y. Wang, Y. Wang, "Learning traffic as images: a deep convolutional neural network for large-scale transportation network speed prediction," Sensors, vol. 17 no. 4,DOI: 10.3390/s17040818, 2017.

[10] R. Yu, Y. Li, C. Shahabi, U. Demiryurek, Y. Liu, "Deep learning: a generic approach for extreme condition traffic forecasting," Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 777-785, .

[11] K. Zhao, D. Khryashchev, J. Freire, C. Silva, H. Vo, "Predicting taxi demand at high spatial resolution: approaching the limit of predictability," IEEE International Conference on Big Data, pp. 833-842, .

[12] H. Yao, F. Wu, J. Ke, X. Tang, Y. Jia, S. Lu, P. Gong, J. Ye, Z. Li, "Deep multi-view spatial-temporal network for taxi demand prediction," Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pp. 2588-2595, .

[13] S. Hochreiter, J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9 no. 8, pp. 1735-1780, DOI: 10.1162/neco.1997.9.8.1735, 1997.

[14] A. Krizhevsky, I. Sutskever, G. E. Hinton, "ImageNet classification with deep convolutional neurawol netrks," International Conference on Neural Information Processing Systems, vol. 60, pp. 1097-1105, 2012.

[15] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, S. Y. Philip, "A comprehensive survey on graph neural networks," IEEE Transactions on Neural Networks and Learning Systems, vol. 32 no. 1,DOI: 10.1109/tnnls.2020.3027426, 2020.

[16] A. Karpathy, G. Toderici, S. Shetty, T. Leung, F. F. Li, "Large-Scale video classification with convolutional neural networks," Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725-1732, .

[17] K. Hara, H. Kataoka, Y. Satoh, "Learning spatio-temporal features with 3D residual networks for action recognition," Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 3154-3160, .

[18] S. Bai, J. Z. Kolter, V. Koltun, "An empirical evaluation of generic convolutional and recurrent networks for sequence modeling," 2018.

[19] S. Liao, L. Zhou, D. Xuan, Y. Bo, J. Xiong, "Large-scale short-term urban taxi demand forecasting using deep learning," Proceedings of the 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 428-433, .

[20] A. Stathopoulos, M. G. Karlaftis, "A multivariate state space approach for urban traffic flow modeling and prediction," Transportation Research Part C: Emerging Technologies, vol. 11 no. 2, pp. 121-135, DOI: 10.1016/s0968-090x(03)00004-4, 2003.

[21] N. Zhang, Y. Zhang, H. Lu, "Seasonal autoregressive integrated moving average and support vector machine models: prediction of short-term traffic flow on freeways," Transportation Research Record, vol. 2215 no. 1, pp. 85-92, DOI: 10.3141/2215-09, 2011.

[22] Z. Zhu, B. Peng, C. Xiong, L. Zhang, "Short‐term traffic flow prediction with linear conditional Gaussian Bayesian network," Journal of Advanced Transportation, vol. 50 no. 6, pp. 1111-1123, DOI: 10.1002/atr.1392, 2016.

[23] W. Huang, G. Song, H. Hong, K. Xie, "Deep architecture for traffic flow prediction: deep belief networks with multitask learning," IEEE Transactions on Intelligent Transportation Systems, vol. 15 no. 5, pp. 2191-2201, DOI: 10.1109/tits.2014.2311123, 2014.

[24] X. Ma, H. Yu, Y. Wang, Y. Wang, "Large-Scale transportation network congestion evolution prediction using deep learning theory," PLoS One, vol. 10 no. 3,DOI: 10.1371/journal.pone.0119044, 2015.

[25] X. M. Chen, M. Zahiri, S. Zhang, "Understanding ridesplitting behavior of on-demand ride services: an ensemble learning approach," Transportation Research Part C: Emerging Technologies, vol. 76, pp. 51-70, DOI: 10.1016/j.trc.2016.12.018, 2017.

[26] Z. Cui, K. Henrickson, R. Ke, Y. Wang, "Traffic graph convolutional recurrent neural network: a deep learning framework for network-scale traffic learning and forecasting," 2018.

[27] X. Geng, Y. Li, L. Wang, L. Zhang, Q. Yang, J. Ye, Y. Liu, "Spatiotemporal multi-graph convolution network for ride-hailing demand forecasting," vol. 33 no. 1, pp. 3656-3663, DOI: 10.1609/aaai.v33i01.33013656, .

[28] B. Yu, M. Li, J. Zhang, Z. Zhu, "3D graph convolutional networks with temporal graphs: a spatial information free framework for traffic forecasting," 2019.

[29] J. Ke, S. Feng, Z. Zhu, H. Yang, J. Ye, "Joint predictions of multi-modal ride-hailing demands: a deep multi-task multi-graph learning-based approach," Transportation Research Part C: Emerging Technologies, vol. 127,DOI: 10.1016/j.trc.2021.103063, 2021.

[30] S. Ji, W. Xu, M. Yang, K. Yu, "3D convolutional neural networks for human action recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35 no. 1, pp. 221-231, DOI: 10.1109/tpami.2012.59, 2013.

[31] T. Du, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, "Learning spatiotemporal features with 3D convolutional networks," Proceedings of the IEEE International Conference on Computer Vision, .

[32] G. B. Huang, H. Lee, E. Learned-Miller, "Learning hierarchical representations for face verification with convolutional deep belief networks," Computer Vision & Pattern Recognition, .

[33] Y. Taigman, Y. Ming, M. A. Ranzato, L. Wolf, "DeepFace: closing the gap to human-level performance in face verification," Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, .

[34] K. Simonyan, A. Zisserman, "Very deep convolutional networks for large-scale image recognition," 2014.

[35] P. J. Brockwell, R. A. Davis, Time Series: Theory and Methods, 2015.

Word count: 6606

Show less

Copyright © 2023 Dapeng Zhang et al. This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Accurate and reliable taxi demand prediction is of great importance for intelligent planning and management in the transportation system. To collectively forecast the taxi demand in all regions of a city, many existing studies focus on the capturing of spatial and temporal correlations among regions but ignore the local statistical differences throughout the geographical layout of a city. This limits the further improvement of prediction accuracy. In this paper, we propose a new deep learning framework, called the locally connected spatial-temporal fully convolutional neural network ( LC-ST-FCN), to learn the spatial-temporal correlations and local statistical differences among regions simultaneously. We evaluate the proposed model on a real dataset from a ride-hailing service platform (DiDi Chuxing) and observe significant improvements compared with a bunch of baseline models. Besides, we further explore the working mechanism of the proposed model by visualizing its feature extraction processes. The visualization results showed that our approach can better localize and capture useful features from spatial-related regions.

Details

Title

Learning Spatial-Temporal Features of Ride-Hailing Services with Fusion Convolutional Networks

Author

Zhang, Dapeng¹

; Xiao, Feng¹

; Kou, Gang¹

; Luo, Jian²; Yang, Fan²

¹ School of Business Administration, Faculty of Business Administration, Southwestern University of Finance and Economics, Chengdu, China
² Chengdu Transportation Operation Coordination Center, Chengdu, China

Editor

Tomio Miwa

Publication year

2023

Publication date

2023

Publisher

John Wiley & Sons, Inc.

ISSN

01976729

e-ISSN

20423195

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1155/2023/4427638

ProQuest document ID

2770535543

Learning Spatial-Temporal Features of Ride-Hailing Services with Fusion Convolutional Networks

Jump to:

Full text

Abstract

Details

Suggested sources