Statistical challenges in utilising Electronic Health Records for medical research

Wednesday 26th September 2018



This meeting will focus on statistical challenges in the use of Electronic Health Records for prognostic modelling and observational epidemiology. The meeting will showcase some of the wide ranging statistical methods that are being developed and applied to large-scale routinely collected clinical data


Elizabeth Williamson (London School of Hygiene and Tropical Medicine)

John Tazare (London School of Hygiene and Tropical Medicine)

Harvey Goldstein (University of Bristol)

Angela Wood (University of Cambridge)

Peter Diggle (Lancaster University and Health Data Research UK)


Venue:  The Hardy Room, De Morgan House, 57-58 Russell Square, London WC1B 4HS

Date: 1.30-5.00pm, Wednesday 26th September 2018

Registration fees: £25 for full or retired members; £10 for student members; £50 for non-members or £70 for non-members including Biometric Society membership for 2018

Organiser:  Michael Sweeting (contact via the webpage



 Document downloads for IBS members.
Join us now.

13:30 - 14:05Elizabeth Williamson (LSHTM)

Missing data in electronic health record research: challenges and (some) solutions

Studies conducted using routinely collected data - such as electronic health records (EHR) - often suffer from missing data.  While the field of missing data methodology is comparatively well developed, certain features of EHR data mean that the application of missing data methods requires specific consideration in this setting. One key issue is the sheer size of the available data, with large numbers of patients, but also large amounts of data collected on each patient over many years. A second issue is that data were not collected for the purpose of research, leading to data-dependent sampling: measurements are often only taken when the patient visits their GP, an action which is often prompted by reasons related to the health status of the patient at that time.

This talk will explore problems of missing data in studies using EHR data, paying particular attention to features specific to this setting. A range of possible missing data methods will be discussed, and illustrated on a range of real EHR studies.



14:05 - 14:40John Tazare (LSHTM)

Application of the high-dimensional Propensity Score (hd-PS) to UK EHRs

The hd-PS algorithm aims to account for confounding by adjusting for a large set of proxies thought to be informative of the underlying health status of patients. Evidence in US claims data (where the algorithm was developed) suggesting the algorithm outperforms investigator lead propensity score approaches and has led to its widespread adoption in a variety of contexts. However, the potential disparity between claims databases and UK EHRs means careful consideration of how source-specific characteristics are handled is necessary to avoid implementing the algorithm in a manner discordant to hd-PS principles. Using a recent study from the Clinical Practice Research Datalink we will present results incorporating a series of modifications aiming to tailor hd-PS principles to this setting.

14:40 - 15:15Harvey Goldstein (University of Bristol)

Linkage matching errors and quantifying uncertainty for model fitting

The talk will describe how the inherent uncertainty about the correctness of record matches can be quantified and utilised in subsequent statistical modelling. The proposal is to extend a Bayesian missing data algorithm to incorporate data priors estimated from a particular record matching algorithm. Implications of this approach for the development of population record 'spines' will be mentioned.

15:15 - 15:45Tea/Coffee Break
15:45 - 16:20Angela Wood (University of Cambridge)




Estimating cardiovascular disease risk in electronic health records with incomplete records and repeated measurements of risk predictors

Stratification of individuals according to their estimated cardiovascular disease (CVD) risk is used to guide clinical decision-making. Current UK guidelines for CVD risk assessment recommend the use of already recorded risk factors in electronic primary care records to prioritise patients for a full formal risk assessment, although there is no guidance on how this should be achieved.

We present a computationally feasible statistical approach to address the methodological challenges in utilizing historical repeat risk factors measures recorded in primary care records to systematically identify patients at high risk of future CVD disease. The approach is principally based on a dynamic two-stage landmark model. The first stage estimates predicted current risk factor values from all available historical repeat risk factor measurements by landmark-age-specific multivariate linear mixed-effects models with correlated random-intercepts, which account for sporadically recorded repeat measures, unobserved data and measurements errors. The second stage predicts future disease risk from a sex-stratified Cox proportional hazards model, with predicted current risk factor values estimated from the first stage.

We have developed and internally validated a dynamic 10-year cardiovascular disease risk prediction model using electronic primary care records for age, diabetes status, hypertension treatment, smoking status, systolic blood pressure, total and high-density lipoprotein cholesterol from ~2 million individuals in ~400 primary care practices in England and Wales contributing to Clinical Practice Research Datalink. We propose using such models as pre-screening tools for identifying individuals who may be at greatest health need of a more formal CVD assessment. Using public health modelling, we identify optimal pre-screening risk thresholds for inviting individuals in for a formal risk assessment.


16:20 - 16:55Peter Diggle (University of Lancaster and Health Data Research UK)

Real-time spatial health surveillance using routinely recorded clinical data

Spatially and temporally referenced data on a wide range of health outcomes are routinely recorded within NHS IT systems. These data present obvious opportunities for monitoring geographical variations in incidence to establish the expected pattern of geographical variation and to detect departures from expectation in real-time.

A natural modelling framework for data of this kind is a spatio-temporal point process.   Ideally, this requires data to be available on all, or at least a completely random sub-set of, incident cases with finely resolved spatial and temporal coordinates of each case.  In practice, spatial resolution is often constrained by concerns about confidentiality of individual-level information, or data are available only from a “sentinel network” such as a sample of general practices within the geographical region of interest.



Existing members can login below to view all site content. Lost password?


Other visitors might be interested to learn more about the benefits of membership.

Other events

11 Nov 20Estimating Abundance and Beyond
28 Oct 20Advances in statistical genomics
22 Sep 20Advances in Survival Analysis
02 Oct 19New perspectives on studying the effects of treatment on a time to event outcome
10 Jul 19 - 12 Jul 197th Channel Network Conference

© 2009-2024 Biometric Society, British and Irish Region | Admin | Read our cookie and privacy policy