# Inferential Challenges for Large Spatio-Temporal Data Structures (17w5153)

Arriving in Banff, Alberta Sunday, December 3 and departing Friday December 8, 2017

## Organizers

(Western University)

Jamie Stafford (University of Toronto)

(University of British Columbia)

(University of Toronto)

## Objectives

A range of inference methodologies are in use for spatio-temporal data, and the complexity and nature of each problem dictates the nature of the methodology used. The high dimensionality of spatio-temporal data often means inference methodologies exhibit a combination of model simplifications, approximations of estimators and likelihood functions, and/or computational and algorithmic efficiencies. Unlike many other ‘big data’ problems, however, an understanding of the physical properties of the spatio-temporal process in question often enables many simplifying assumptions to be made which lead to convenient and enabling mathematical properties (i.e.\ Markov property, stationarity, various forms of conditional independence). The workshop will be comprised of talks in the following four broad areas, with talks having a common emphasis on computational tractability and an accommodation of large, high resolution spatio-temporal datasets. Interactions with subject-area specialists in some of the application areas will also feature prominently in the workshop, through presentations, but also through panel discussions and roundtables. One class of spatio-temporal problems on which the workshop will focus involves making inference on a latent process $\lambda(s,t)$ and model parameters $\theta$ given a set of observations $Y_i$ consisting of a response variable, covariates, a spatial location and a time. A moderately high resolution problem might involve making inference on $\lambda(s,t)$ at locations on a 100 by 100 grid and 10 different time points. This problem would involve latent variables of dimension 100,000 and potentially a covariance matrix with 10$^{10}$ entries. The methods described below are able to accommodate problems of roughly this order of magnitude. Identifying refinements to models, inference methodologies, and algorithms necessary for accommodating resolutions on the order of 600 by 800 by 100 will be a common theme throughout the workshop. \begin{description} \item[Markov Chain Monte Carlo:] These methods can be extremely computationally intensive, particularly for high-dimensional problems where constructing efficient proposal distributions is difficult. MCMC methods will always be indispensible for providing a gold standard against which simpler methods can be evaluated, and as a last resort for fitting models where suitable approximate methods are unavailable. Inference via MCMC for latent Gaussian spatial point process data (of which disease incidence data is one example) has been undertaken for moderately high resolution spatio-temporal problems. Statistical efficiency gains may be achievable if bivariate or multivariate outcomes, each linked to correlated latent random fields, are jointly modelled; for example, one could envision asthma and lung cancer outcomes being modelled in space and time as they relate to environmental covariates, which are measured in space and time, at varying levels of precision and granularity. Further exploitation of recent advances in MCMC methodology such as Hamiltonian MCMC and particle MCMC will lead to an improvement in these algorithms and the dimensionality of the problems which can be accommodated. \item[Asymptotic approximations:] Integrated nested Laplace approximations exploit a higher order asymptotic (HOA) technique pioneered by who show that separate Laplace approximations for the numerator and denominator of a posterior integral leads to greater accuracy than use of the same Laplace approximation for the posterior density alone. Building INLA-like algorithms with the large scale use of alternative HOA techniques is a research avenue that’s likely to be fruitful. For example, the r$^*$ formula for computing highly accurate tail probabilities may yield pointwise exceedance surfaces as well as providing computational advantages over INLA which relies on numerical integration for such computations. The parallelizable nature of many HOA methods, in comparison with the inherent iterative nature of MCMC algorithms, point to asymptotic inference methods gaining in importance. An avenue to be explored concerns whether these computational approaches can be adapted to handle joint modelling problems.
• Kernel smoothing techniques are the most straightforward and (at least in their
• most basic form) least computationally intensive technique for spatio-temporal data. Kernel smoothing can be extended into a local-likelihood algorithm to accommodate covariates or offsets, and aggregated data. The local-likelihood framework can also be exploited in situations where the underlying data-generating mechanism follows or approximately follows the solution of a differential equation. Other qualitative constraints, such as monotonicity and convexity, can also be exploited. Efficient implementations of bootstrap techniques, involving a minimum amount of recomputation for each separate bootstrap sample, will enable uncertainty quantification, keeping in mind the possibility of distortions due to bias. Developing alternatives to computationally intensive cross-validation methods for bandwidth selection is a pressing concern.
• One way of simplifying a spatio-temporal model to overcome the problem
• of high-dimensionality is a ‘low-rank’ approach using a manageable number of spatio-temporal basis functions. Examples include Fourier basis functions (and possibly inference in the frequency domain), and spatio-temporal polynomial splines. A second approach to simplifying models is through the use of Markov random fields, where conditional independence produces a sparse precision matrix, with the dependence structure in the MRF chosen to mimic a continuously varying spatio-temporal process. Extending MRF models to new classes of problems (non-stationarity, for example) is an active area of research. A further possibility is matrix-free estimating equations, which reduces memory useage to the point where parallelization is possible on a large scale. How this methodology enters the modelling of cluster point processes with application to environmental processes such as storms and wildfire occurrence will also be considered. \end{description} Developing models capable of reflecting the richness of spatio-temporal dependence structures with computational feasibility and efficiency is a common concern to applied statisticians in each of these areas. The dominant theme in statistical methods development concerned with spatial epidemiology is making inference on disease risk in space and/or time from observed case incidence data. Spatial information on incident cases is typically aggregated or censored to regions having varying degrees of precision and regularity, with temporal information generally well defined and regular (aggregated daily or yearly). Certain applications in ecology, such as the modelling of diseased trees, and associated infestations, involve similarly aggregated data. In either context, these data are frequently zero-heavy, which leads to additional model complexity as well as the associated difficulties with parameter estimation. Similar complexities arise in surveys of fish stocks collected over oceanic regions. Trends in time are particularly important in assessing the vulnerability of fisheries. Here data is again aggregated in space and time, but may also be missing particularly in regions where no fishing occurs. Graduate students and postdoctoral fellows may be involved in any aspect of the workshop proceedings as appropriate. Often they are the most familiar with emerging datasets, new computational methodologies and so on. Those with sufficient maturity may lead formal discussions or be invited to speak on their dissertation or current research. Senior researchers attending the workshop will be asked to identify appropriate students and funds will be available for their expenses. At the very least students will be invited to present posters in the common area and the program may incorporate this in a formal way, say, as a catered lunch event.