Checking the validity of obtaining additional health measure estimates in English census data

Yes box ticked on a questionnaire

Although population-level administrative datasets are increasingly available to researchers, they do not always include all variables of interest. For example, smoking status is not often recorded in routine data. On the other hand, surveys of the population often include detailed questions but typically lack the population coverage in terms of sample size. An uncommon approach to obtaining estimates for areas and groups with small numbers in surveys is to combine estimates obtained from modelling survey samples with related population level information from census data. For example, representative surveys of smoking for local areas in the UK are rare but there is a need to monitor local trends and to assess interventions. Local level estimates are produced by multilevel modelling smoking by individual level and small area predictors using survey data and then subsequently weighting to the local profile using census data. This approach can produce representative local estimates even when the original survey data are unrepresentative and the estimates can then also be used in population level studies. Yet such a technique results in aggregate level estimates and it may preferable to have estimates at the individual level as one would get if, for example, smoking had been included as a census question.

We are seeking to ascertain if multiple imputation can be used to replicate estimates of self-rated health from census data by combination with a sample survey. 

We aim to use multiple imputation to solve what can be configured as a missing data problem. Ultimately we hope to use this approach to generate a smoking variable for the UK census, where it is not asked. We propose to achieve this in three steps, the first two of which involve validation and, contingent on these, the third which involves the execution of the multiple imputation of smoking responses for the UK census. The first step is the focus of this project: using survey data on self-rated health and predictors combined with census data on predictors, impute for the census self-rated health. As the UK census does ask a self-rated health question, this provides an excellent check on our method as we can ascertain the validity of imputed self-rated health against the actual answer.

We are using data from England - Integrated Household Survey - with self-rated general health being asked using similar questions to those in the UK 2011 census. Our census dataset is the 2011 Sample of Anonymised Records (SARs) – a sample of the census with just under 3 million cases containing a larger number of individual variables than standard census tables (limited to three/four way comparisons) and the geographic coverage is broader (local authority/regional). We are checking the effectiveness of our imputation by comparing regional/local authorities’ prevalence estimates by socioeconomic variables using the original and imputed self-rated health answers in the census and assess variation by age and sex. We are also comparing results of models of inequalities in self-rated health. 

MRC/CSO Social and Public Health Sciences Unit logo 800 wide