1 Introduction
Multivariate spatio-temporal data are ubiquitous in our society. In finance and economics, stock prices and economic indicators are tracked over time; in logistics, supermarkets collect product level data to decide the optimal stock levels for different stores; in meteorology, weather stations record climate variables to monitor climate change and its impact on agriculture, human health, and natural disasters. Spatio-temporal data are observations recorded with time and geographic location information. Multivariate means that multiple variables (e.g. temperature, precipitation, wind speed and direction) are recorded. It is common for analytical methods to focus separately on the aspects – time series analysis addresses temporal trends, spatial analysis examines geographic patterns, and multivariate analysis models the relationship between the multiple measured variables. However, all the three aspects are ideally considered together to tackle contemporary problems, such as monitoring droughts which requires historical time-stamped data to understand “normal” conditions for any spatial neighborhood, and the interactions of precipitation, temperature and other variables. From multivariate spatio-temporal data, decision-makers compute indexes constructed from these components to inform the public and to determine when or if to take remedial action.
This research addresses the challenge of investigating multivariate spatio-temporal data together and also independently, by providing new tools for organising, visualising and explaining relationships. The illustration in Figure 1.1 shows how the three research topics are related and provide solutions. The solutions are to provide easy ways to pivot between the three components, to allow focusing on multivariate, or spatio-temporal components of the data and a new data pipelines for constructing indexes for monitoring different aspects of our world using multiple variables. When fixing the time, the data are reduced to its spatial and multivariate elements. When the spatial component represents observations, it can be analysed using multivariate methods such as dimension reduction and the particular dimension reduction technique investigated in this thesis is called projection pursuit. When the data are collected at different locations in space, software from geo-informatics can be useful to analyse the spatial aspect of the data. However, existing spatial and temporal data analysis software are built upon different data formats. In order to combine spatial and temporal data for spatio-temporal analysis, the spatial data need to duplicate observations at each time point. However, these duplicates can lead to inefficiency for spatial analysis. This introduces the constant need to combine and separate the two components to align with the existing software, creating frictions in the data analysis. Multivariate spatio-temporal variables are often combined into a single series for each location to produce an index series which can be used for decision making or communicating conditions. But index definition and construction is vastly different in different fields and different researchers making it difficult to understand how they might perform with slight changes in the formula, or be affected by data quality issues and how competing indexes compare.
1.1 Visual diagnostics for projection pursuit optimization
Many data, despite having different multivariate, spatial, and temporal features, can all be categorised as multivariate spatio-temporal data. When there are only a few time snaps in the data, we may treat each time snap as independent and apply multivariate methods to analyze the variable relationship. Bivariate relationships, either linear or non-linear, can be represented in a scatterplot matrix, however, it becomes complex when a certain relationship is attributed to three or more variables. A dimension reduction technique called “projection pursuit” can be used to find interesting structures in multivariate data by linear projection. Combined with a visualisation technique called guided tour, it can show the data points smoothly transiting from randomness to some interesting structures found by the algorithm. In projection pursuit, multivariate data is transformed into a low dimensional space, typically 1D or 2D, using a projection matrix. Each projection matrix corresponds to a projection of the data, on which statistics, also called the index functions, can be computed. The projection pursuit algorithm optimises the statistics on the set of orthonormal matrices to detect interesting patterns in the data, such as clusters or outlines. In practice, however, the optimiser sometimes does not always work as desired: it may fail unexpectedly, gets stuck at a local maximum, or approach the maximum without reaching it. In this work, four diagnostic plots are proposed to track the optimization algorithms in projection pursuit.
1.2 Cubble: A new spatio-temporal data structure
When multivariate spatio-temporal data contain only a collection of variables, the spatial and temporal dimensions remain to be explored. Weather station data is one such example, where the number of variables recorded depends on the instruments installed, while stations are widely distributed spatially and daily data are available over years. Spatial and temporal data analysis each provide tools for examining one dimension of the data, however, when working with spatio-temporal data, researchers may switch their data among different forms (pure spatial, pure temporal, or a combined table) to analyse the data. This presents a unique task of coordinating the data and results from different formats, which is not the case when the data all have a single observational unit. While the actual process to reshape the data may not be difficult for a given audience, this repeated requirement to reorganise the data is disruptive from a workflow perspective, forcing researchers to pause on the actual data analysis and turn to transforming among different data formats. This research addresses this problem by proposing a new data structure to organise spatio-temporal data in R so that different spatial and temporal information can be easily accessed for exploratory data analysis.
1.3 A tidy framework for indexes
Multivariate spatio-temporal data can also be analysed as multivariate time series with fixed observations. To visualize and explain this collection of multivariate time series, indexes can be constructed to monitor the joint effect of multiple variables over time or to compare information from different observations. Examples of such indexes can be found in monitoring the environment (i.e. drought indexes and water quality indexes), measuring social development (i.e. human development index and gender equality index), and making decisions on resource allocation. While research institutes and government publish index values calculated according to standard practice, information should be made available to understand how the index may behave under different data conditions and its implication for decision making. In this work, we develop a general data pipeline to construct spatio-temporal indexes from multivariate data. This provides researchers with a standard framework for constructing and analysing indexes, including experimenting different parameter choices, adjusting steps in the index definition, calculating uncertainty, and assessing index robustness. The design of the pipeline framework is aligned with the tidy framework adopted by the tidyverse and tidymodel, allowing the construction and analysis of indexes in a unified syntax, regardless of their application domains.
1.4 Thesis overview
The rest of the thesis is organized as follows: Chapter 2 presents the proposed visual diagnostics plots designed to assess the optimisation in projection pursuit guided tour, along with the R implementation, ferrn
(Zhang et al. 2021). In Chapter 3, a novel data structure and the R package, cubble
(Zhang et al. 2023a), is introduced to organise spatio-temporal data, with examples given to demonstrate its use in analysing weather station data. Chapter 4 proposes a framework for constructing spatio-temporal indexes from data and the resulting data pipeline is implemented in the package tidyindex
(Zhang et al. 2023b). Chapter 5 concludes the thesis and discusses potential future directions.