Commit b34e8436 authored by Chris Jewell's avatar Chris Jewell
Browse files

Merge branch 'spatial-effect' into 'master'

Spatial effect

See merge request !43
parents b72aeece c939ddec
......@@ -7,86 +7,12 @@ information. Details of the model implementation may be found in `doc/lancs_spa
## Workflow
The entire analysis chain, from case data through parameter inference and predictive
simulation to results summarised as long-format XLSX and geopackage documents.
The pipeline is run using the [`ruffus`](http://ruffus.org.uk) computational pipeline library.
The library relies heavily on [TensorFlow](https://tensorflow.org) and
[TensorFlow Probability](https://tensorflow.org/probability) machine learning libraries, and is
optimised for GPU acceleration (tested on NVIDIA K40 and V100 cards). This package also imports
an experimental library [`gemlib`](http://fhm-chicas-code.lancs.ac.uk/GEM/gemlib) hosted at Lancaster University.
This library is under active development as part of the [Wellcome Trust](https://wellcome.ac.uk) `GEM` project.
The pipeline gathers data from [the official UK Government website for data and insights on Coronavirus]
(https://coronavirus.data.gov.uk), together with population and mobility data taken from the [UK
Office for National Statistics Open Geography Portal](https://geoportal.statistics.gov.uk).
### Quickstart
```bash
$ poetry install # Python dependencies
$ poetry run python -m covid.pipeline --config example_config.yaml --results-dir <output_dir>
```
The global pipeline configuration file `example_config.yaml` contains sections for pipeline
stages where required. See file for documentation.
### Model specification
The `covid.model_spec` module contains the model specified as a `tensorflow_probability.distributions.JointDistributionNamed`
instance `Covid19UK`. This module also contains a model version number, and constants such as the stoichiometry matrix
characterising the state transition model, an accompanying next generation matrix function, a function to assemble data
specific to the model, and a function to initialise censored event data explored by the MCMC inference algorithm.
### Pipeline stages
Each pipeline stage loads input and saves output to disc. This is inherent to the `ruffus` pipeline
architecture, and provides the possibility to run different stages of the pipeline manually, as well as
introspection of data passed between each stage.
1. Data assembly: `covid.tasks.assemble_data` downloads or loads data from various sources, clips
to the desired date range require needed, and bundles into a pickled Python dictionary `<output_dir>/pipeline_data.pkl`.
2. Inference: `covid.tasks.mcmc` runs the data augmentation MCMC algorithm described in the concept note, producing
a (large!) HDF5 file containing draws from the joint posterior distribution `posterior.hd5`.
3. Sample thinning: `covid.tasks.thin_posterior` further thins the posterior draws contained in the HDF5 file into a (much
smaller) pickled Python dictionary `<output_dir>/thin_samples.pkl`
4. Next generation matrix: `covid.tasks.next_generation_matrix` computes the posterior next generation matrix for the
epidemic, from which measures of Local Authority District level and National-level reproduction number can be derived.
This posterior is saved in `<output_dir>/ngm.pkl`.
5. National Rt: `covid.tasks.overall_rt` evaluates the dominant eigenvalue of the next generation matrix samples using
power iteration and Rayleigh Quotient method. The dominant eigenvalue of the inter-LAD next generation matrix gives the
national reproduction number estimate.
6. Prediction: `covid.tasks.predict` calculates the Bayesian predictive distribution of the epidemic given the observed
data and joint posterior distribution. This is used in two ways:
- in-sample predictions are made for the latest 7 and 14 day time intervals in the observed data time window. These
are saved as `<output_dir>/insample7.pkl` and `<output_dir>/insample14.pkl` `xarray` data structures.
- medium-term predictions are made by simulating forward 56 days from the last+1 day of the observed data time window. These is saved as `<output_dir>/medium_term.pkl` `xarray` data structure.
7. Summary output:
- LAD-level reproduction number: `covid.tasks.summarize.rt` takes the column sums of the next generation matrix as the
LAD-level reproduction number. This is saved in `<output_dir>/rt_summary.csv`.
- Incidence summary: `covid.tasks.summarize.infec_incidence` calculates mean and quantile information for the medium term prediction, `<output_dir>/infec_incidence_summary.csv`.
- Prevalence summary: `covid.tasks.summarize.prevalence` calculated the predicted prevalence of COVID-19 infection
(model E+I compartments) at LAD level, `<output_dir>/prevalence_summary.csv`.
- Population attributable risk fraction for infection: `covid.tasks.within_between` calculates the population
attributable fraction of within-LAD versus between-LAD infection risk, `<output_dir>/within_between_summary.csv`.
- Case exceedance: `covid.tasks.case_exceedance` calculates the probability that observed cases in the last 7 and 14
days of the observed timeseries exceeding the predictive distribution. This highlights regions that are behaving
atypically given the model, `<output_dir>/exceedance_summary.csv`.
8. In-sample predictive plots: `covid.tasks.insample_predictive_timeseries` plots graphs of the in-sample predictive
distribution for the last 7 and 14 days within the observed data time window, `<output_dir>/insample_plots7` and
`<output_dir>/insample_plots14`.
9. Geopackage summary: `covid.tasks.summary_geopackage` assembles summary information into a `geopackage` GIS file,
`<output_dir>/prediction.pkg`.
10. Long format summary: `covid.tasks.summary_longformat` assembles reproduction number, observed data, in-sample, and medium-term
predictive incidence and prevalence (per 100000 people) into a long-format XLSX file.
This repository contains code that produces Monte Carlo samples of the Bayesian posterior distribution
given the model and case timeseries data from [coronavirus.data.gov.uk](https://coronavirus.data.gov.uk),
implementing an ETL step, the model itself, and associated inference and prediction steps.
Users requiring an end-to-end pipeline implementation should refer to the [covid-pipeline](https://github.com/chrism0dwk/covid-pipeline)
repository.
## COVID-19 Lancaster University data statement
......@@ -103,13 +29,6 @@ UTLA: Upper Tier Local Authority
LAD: Local Authority District
### Files
* `covid` Python package
* `example_config.yaml` example configuration file containing data paths and MCMC settings
* `data` a directory containing example data (see below)
* `pyproject.py` a PEP518-compliant file describing the `poetry` build system and dependencies.
## Example data files
* `data/c2019modagepop.csv` a file containing local authority population data in the UK, taken from ONS prediction for December 2019. Local authorities [City of Westminster, City of London] and [Cornwall, Isles of Scilly] have been aggregated to meet commute data processing requirements.
* `data/mergedflows.csv` inter local authority mobility matrix taken from UK Census 2011 commuting data and aggregated up from Middle Super Output Area level (respecting aggregated LADs as above).
......
......@@ -56,7 +56,7 @@ def _get_window_sizes(num_adaptation_steps):
return first_window_size, slow_window_size, last_window_size
@tf.function(jit_compile=False)
@tf.function(autograph=False, jit_compile=False)
def _fast_adapt_window(
num_draws,
joint_log_prob_fn,
......@@ -114,7 +114,7 @@ def _fast_adapt_window(
return draws, trace, step_size, weighted_running_variance
@tf.function(jit_compile=False)
@tf.function(autograph=False, jit_compile=False)
def _slow_adapt_window(
num_draws,
joint_log_prob_fn,
......@@ -183,7 +183,7 @@ def _slow_adapt_window(
)
@tf.function(jit_compile=False)
@tf.function(autograph=False, jit_compile=False)
def _fixed_window(
num_draws,
joint_log_prob_fn,
......@@ -266,13 +266,19 @@ def trace_results_fn(_, results):
def draws_to_dict(draws):
num_locs = draws[1].shape[1]
num_times = draws[1].shape[2]
return {
"psi": draws[0][:, 0],
"beta_area": draws[0][:, 1],
"gamma0": draws[0][:, 2],
"gamma1": draws[0][:, 3],
"alpha_0": draws[0][:, 4],
"alpha_t": draws[0][:, 5:],
"sigma_space": draws[0][:, 1],
"beta_area": draws[0][:, 2],
"gamma0": draws[0][:, 3],
"gamma1": draws[0][:, 4],
"alpha_0": draws[0][:, 5],
"alpha_t": draws[0][:, 6 : (6 + num_times - 1)],
"spatial_effect": draws[0][
:, (6 + num_times - 1) : (6 + num_times - 1 + num_locs)
],
"seir": draws[1],
}
......@@ -356,7 +362,8 @@ def run_mcmc(
current_state = [s[-1] for s in draws]
draws[0] = param_bijector.inverse(draws[0])
posterior.write_samples(
draws_to_dict(draws), first_dim_offset=offset,
draws_to_dict(draws),
first_dim_offset=offset,
)
posterior.write_results(trace, first_dim_offset=offset)
offset += first_window_size
......@@ -388,7 +395,8 @@ def run_mcmc(
current_state = [s[-1] for s in draws]
draws[0] = param_bijector.inverse(draws[0])
posterior.write_samples(
draws_to_dict(draws), first_dim_offset=offset,
draws_to_dict(draws),
first_dim_offset=offset,
)
posterior.write_results(trace, first_dim_offset=offset)
offset += window_num_draws
......@@ -408,7 +416,8 @@ def run_mcmc(
current_state = [s[-1] for s in draws]
draws[0] = param_bijector.inverse(draws[0])
posterior.write_samples(
draws_to_dict(draws), first_dim_offset=offset,
draws_to_dict(draws),
first_dim_offset=offset,
)
posterior.write_results(trace, first_dim_offset=offset)
offset += last_window_size
......@@ -434,10 +443,12 @@ def run_mcmc(
current_state = [state_part[-1] for state_part in draws]
draws[0] = param_bijector.inverse(draws[0])
posterior.write_samples(
draws_to_dict(draws), first_dim_offset=offset,
draws_to_dict(draws),
first_dim_offset=offset,
)
posterior.write_results(
trace, first_dim_offset=offset,
trace,
first_dim_offset=offset,
)
offset += config["num_burst_samples"]
......@@ -502,8 +513,9 @@ def mcmc(data_file, output_file, config, use_autograph=False, use_xla=True):
tfb.Softplus(low=dtype_util.eps(DTYPE)),
tfb.Identity(),
tfb.Identity(),
tfb.Identity(),
],
block_sizes=[1, 3, events.shape[1]],
block_sizes=[2, 4, events.shape[1] - 1, events.shape[0]],
)
)
......@@ -512,11 +524,17 @@ def mcmc(data_file, output_file, config, use_autograph=False, use_xla=True):
return model.log_prob(
dict(
psi=params[0],
beta_area=params[1],
gamma0=params[2],
gamma1=params[3],
alpha_0=params[4],
alpha_t=params[5:],
sigma_space=params[1],
beta_area=params[2],
gamma0=params[3],
gamma1=params[4],
alpha_0=params[5],
alpha_t=params[6 : (6 + events.shape[1] - 1)],
spatial_effect=params[
(6 + events.shape[1] - 1) : (
6 + events.shape[1] - 1 + events.shape[0]
)
],
seir=events,
)
) + param_bij.inverse_log_det_jacobian(
......@@ -530,8 +548,12 @@ def mcmc(data_file, output_file, config, use_autograph=False, use_xla=True):
current_chain_state = [
tf.concat(
[
np.array([0.1, 0.0, 0.0, 0.0], dtype=DTYPE),
np.full(events.shape[1], -1.75, dtype=DTYPE,),
np.array([0.0, 0.0, 0.0, 0.0, 0.0], dtype=DTYPE),
np.full(
events.shape[1] + events.shape[0],
0.0,
dtype=DTYPE,
),
],
axis=0,
),
......@@ -553,7 +575,8 @@ def mcmc(data_file, output_file, config, use_autograph=False, use_xla=True):
)
posterior._file.create_dataset("initial_state", data=initial_state)
posterior._file.create_dataset(
"time", data=np.array(dates).astype(str).astype(h5py.string_dtype()),
"time",
data=np.array(dates).astype(str).astype(h5py.string_dtype()),
)
print(f"Acceptance theta: {posterior['results/hmc/is_accepted'][:].mean()}")
......
......@@ -26,6 +26,26 @@ TIME_DELTA = 1.0
NU = tf.constant(0.28, dtype=DTYPE) # E->I rate assumed known.
def _compute_adjacency_matrix(geom, names, tol=200):
mat = geom.apply(lambda x: geom.distance(x) < tol).to_numpy()
np.fill_diagonal(mat, False)
# Fix for islands > tol apart
num_neighbours = mat.sum(axis=-1)
islands = np.where(num_neighbours == 0)[0]
closest_neighbour = [
geom.distance(geom.iloc[i]).argsort()[1] for i in islands
]
mat[islands, closest_neighbour] = True
mat = mat | mat.T # Ensure symmetry
return xarray.DataArray(
mat.astype(DTYPE), # Coerce to global float type
coords=[names, names],
dims=["location_dest", "location_src"],
)
def gather_data(config):
"""Loads covariate data
......@@ -43,9 +63,12 @@ def gather_data(config):
commute_volume = read_traffic_flow(
config["commute_volume"], date_low=date_low, date_high=date_high
)
geo = gp.read_file(config["geopackage"])
geo = gp.read_file(config["geopackage"], layer="UK2019mod_pop_xgen")
geo = geo.sort_values("lad19cd")
geo = geo[geo["lad19cd"].isin(locations["lad19cd"])]
adjacency = _compute_adjacency_matrix(geo.geometry, geo["lad19cd"], 200)
area = xarray.DataArray(
geo.area,
name="area",
......@@ -68,6 +91,7 @@ def gather_data(config):
C=mobility.astype(DTYPE),
W=commute_volume.astype(DTYPE),
N=popsize.astype(DTYPE),
adjacency=adjacency,
weekday=weekday.astype(DTYPE),
area=area.astype(DTYPE),
locations=xarray.DataArray(
......@@ -132,6 +156,7 @@ def CovidUK(covariates, initial_state, initial_step, num_steps):
)
def alpha_t():
"""Time-varying force of infection"""
return tfd.MultivariateNormalDiag(
loc=tf.constant(0.0, dtype=DTYPE),
scale_diag=tf.fill(
......@@ -139,6 +164,27 @@ def CovidUK(covariates, initial_state, initial_step, num_steps):
),
)
def sigma_space():
"""Variance of CAR prior on space"""
return tfd.HalfNormal(scale=tf.constant(0.1, dtype=DTYPE))
def spatial_effect():
W = tf.convert_to_tensor(covariates["adjacency"])
Dw = tf.linalg.diag(tf.reduce_sum(W, axis=-1)) # row sums
rho = 0.5
precision = Dw - rho * W
cov = tf.linalg.inv(precision)
scale = tf.linalg.cholesky(cov)
return tfd.MultivariateNormalTriL(
loc=tf.constant(0.0, DTYPE),
scale_tril=scale,
)
# return tfd.MultivariateNormalDiag(
# loc=tf.constant(0.0, dtype=DTYPE),
# scale_diag=tf.ones(covariates["adjacency"].shape[0], dtype=DTYPE)
# * sigma_space,
# )
def gamma0():
return tfd.Normal(
loc=tf.constant(0.0, dtype=DTYPE),
......@@ -151,7 +197,16 @@ def CovidUK(covariates, initial_state, initial_step, num_steps):
scale=tf.constant(100.0, dtype=DTYPE),
)
def seir(psi, beta_area, alpha_0, alpha_t, gamma0, gamma1):
def seir(
psi,
beta_area,
alpha_0,
alpha_t,
spatial_effect,
sigma_space,
gamma0,
gamma1,
):
psi = tf.convert_to_tensor(psi, DTYPE)
beta_area = tf.convert_to_tensor(beta_area, DTYPE)
alpha_t = tf.convert_to_tensor(alpha_t, DTYPE)
......@@ -199,8 +254,7 @@ def CovidUK(covariates, initial_state, initial_step, num_steps):
),
),
)
eta = alpha_t_ + beta_area * log_area
eta = alpha_t_ + beta_area * log_area + sigma_space * spatial_effect
infec_rate = tf.math.exp(eta) * (
state[..., 2]
+ psi
......@@ -236,6 +290,8 @@ def CovidUK(covariates, initial_state, initial_step, num_steps):
beta_area=beta_area,
psi=psi,
alpha_t=alpha_t,
sigma_space=sigma_space,
spatial_effect=spatial_effect,
gamma0=gamma0,
gamma1=gamma1,
seir=seir,
......@@ -287,7 +343,11 @@ def next_generation_matrix_fn(covar_data, param):
),
)
eta = alpha_t_ + param["beta_area"] * log_area[:, tf.newaxis]
eta = (
alpha_t_
+ param["beta_area"] * log_area[:, tf.newaxis]
+ param["sigma_space"] * param["spatial_effect"]
)
infec_rate = (
tf.math.exp(eta)
* (
......
[tool.poetry]
name = "covid19uk"
version = "0.7.5"
version = "0.8.0-alpha.0"
description = "Spatial stochastic SEIR analysis of COVID-19 in the UK"
authors = ["Chris Jewell <c.jewell@lancaster.ac.uk>"]
license = "MIT"
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment