Ranking spatial areas by risk of cancer: modelling in epidemiological surveillance

Pablo Fernández-Navarro; Javier González-Palacios; Mario González-Sánchez; Rebeca Ramis; Olivier Nuñez; Francisco Palmí-Perales; Virgilio Gómez-Rubio

doi:10.21037/ace-20-15

Original Article

Ranking spatial areas by risk of cancer: modelling in epidemiological surveillance

Pablo Fernández-Navarro^1,2,3, Javier González-Palacios^1,3, Mario González-Sánchez^1,3, Rebeca Ramis^1,2, Olivier Nuñez^1,2, Francisco Palmí-Perales⁴, Virgilio Gómez-Rubio⁴

¹Cancer and Environmental Epidemiology Unit, Department of Epidemiology of Chronic Diseases, National Center for Epidemiology, Carlos III Institute of Health, Madrid, Spain; ²Consortium for Biomedical Research in Epidemiology & Public Health (CIBER en Epidemiología y Salud Pública-CIBERESP), Madrid, Spain; ³Bioinformatics and Data Management Group (BIODAMA), Department of Epidemiology of Chronic Diseases, National Center for Epidemiology, Carlos III Institute of Health, Madrid, Spain; ⁴Department of Mathematics, School of Industrial Engineering-Albacete, Universidad de Castilla-La Mancha, Albacete, Spain

Contributions: (I) Conception and design: P Fernández-Navarro, V Gómez-Rubio; (II) Administrative support: None; (III) Provision of study material or patients: P Fernández-Navarro; (IV) Collection and assembly of data: P Fernández-Navarro, J González-Palacios, M González-Sánchez; (V) Data analysis and interpretation: All authors; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Pablo Fernández-Navarro. Cancer and Environmental Epidemiology Unit, Department of Epidemiology of Chronic Diseases, National Center for Epidemiology (Pab. 12), Carlos III Institute of Health, Avda. Monforte de Lemos, 5, 28029 Madrid, Spain. Email: pfernandezn@isciii.es.

Background: The representation and analysis of maps of events within a fixed time frame/period has been established as a basic tool for disease monitoring. In addition to having methods that can address the study of certain problems, the existence of criteria to discriminate relevant results is equally important. In chronic diseases such as cancer, monitoring the spatial distribution of mortality/morbidity in small areas through relative risk (RR) estimators is used frequently, but there is no clear strategy to discriminate which regions are important. Moreover, it usually requires substantial time for an effective surveillance or an advanced technical knowledge. The objectives of this study are to first establish a data analysis pipeline that allows users to make an initial screening for exploratory purposes so they can identify regions of interest in the context of chronic diseases monitoring and second to develop an R-Shiny application to implement this strategy in a straightforward way without requiring strong technical knowledge.

Methods: First, a pipeline of seven steps for ranking risk of disease spatial areas was developed taking into account relative and absolute risk estimators, using observed and expected cases in spatial units of a study region. Second, an R-Shiny application (RANKSPA, Ranking Spatial Areas) was developed to perform the pipeline. Third, we applied the pipeline using RANKSPA to simulated and real data of lung cancer municipal mortality 2005–2009 in Galicia (North-East of Spain), a region with 314 spatial units.

Results: There was a clear excess of mortality in the middle-east of the studied region using simulated data where a spatial mortality cluster is also located, existing 5 spatial units outside this cluster that occupy the top positions in the ranking generated by the application. From the total spatial units of the study region [314], only 14 had an excess of mortality whose posterior probabilities are greater than or equal to 80%. In addition, all the spatial tests implemented, with the exception of Moran’s I test, were statistically significant. In the study of real data, a clear excess of mortality was observed in the east part of the study region where several of the spatial mortality clusters are also located. Moreover, there are three spatial units located outside these clusters that occupy the top positions in the ranking generated by the application. Eleven spatial units have an excess of mortality, with their posterior probability (PP) greater than or equal to 80%. All the spatial tests implemented were statistically significant. Both in simulated and real data, there was a positive correlation between absolute and relative measures. However, a greater dispersion was observed when these measures take the highest values.

Conclusions: The work presented shows a strategy of exploratory analyses to provide an initial assessment of geographical patterns in disease risk, focused primarily on chronic diseases such as cancer. Furthermore, an R-Shiny application has been created to ease the implementation of this strategy without requiring substantial technical knowledge.

Keywords: Spatial analysis; ranking; epidemiological surveillance; ranking application; screening; cancer

Received: 17 February 2020; Accepted: 27 October 2020; Published: 30 November 2020.

doi: 10.21037/ace-20-15

Introduction

Public health surveillance includes a wide range of monitoring methods, being the spatial or spatio-temporal study of the diseases a reasonable starting point for any grounded health intervention. Likewise, strategies that employ or integrate different statistical methods could be relevant to best aid the task of this surveillance.

Spatial surveillance merges monitoring statistics for evidence of a change and spatial techniques which are often used to find or describe the extent of clustering across a map. Other ideas related to spatial surveillance is that of screening, which could be applied to populations as well as individuals (1), where the location of the public health event is as important as the fact that it occurred. Surveillance and screening carried out at an aggregate population level (e.g., municipality, census track, etc.) could be useful to detect exceeded limits based on observed or expected patterns, to know where the health incidents occur, and possibly allowing to anticipate where they will occur. All of these might trigger interventions designed, for example, to redirect health resources towards attempts to improve the health status of the population as well as it can be useful for prevention programs.

The representation and analysis of maps of events within a fixed time frame/period is established as a basic tool for disease monitoring. In this context, there is a wide range of methods that can be applied, such as disease mapping, spatial clustering or ecological analysis, each of them is appropriate to answer a specific question, and that usually requires advanced technical knowledge (programming, statistics, etc.) to be implemented. In that way, there are valuable applications or statistical packages such as “GeoDa” software (2) or “SSTCDapp” (3), and the R-Shiny web application “SpatialEpiApp” (4), that allow us to perform these methods in a simple way and without extensive technical knowledge.

Furthermore, in spatial surveillance, apart from having specific methods, it is equally important that there exists a criterion to discriminate relevant results. For example, in disease mapping, which usually involves Bayesian models to smooth the underlying risk estimates, the posterior probability (PP) that the relative risk (RR) is higher than 1 is used as a Bayesian method to assess high risk. Insofar as this indicator was concerned, it is usual to follow Richardson’s criterion (5), which recommends that probabilities above 0.8 should be deemed significant. Although this approach can improve the discrimination of the areas of interest based on the RR, in the context of public health and surveillance the methods should consider absolute measures related to, for example, the magnitude of the expected counts or other approaches to rank risk estimates with PP higher than 80%.

In chronic diseases, such as cancer, studying the spatial distribution of mortality or morbidity in small areas through RR estimators is very frequently used for monitoring (6-11). In this sense, cancer mortality or incidence atlases (12-15) are very useful tools. Nevertheless, on many occasions, the time required for conducting these analyzes or atlases is usually too long for effective surveillance or to develop interventions and address problems in time. In addition, there is no clear strategy to discriminate which regions are important, both from the point of view of relative estimators and absolute estimators important for public health.

Accordingly, the objectives of this work are: (I) to establish a pipeline of analysis that allows us to discriminate regions of interest in the context of chronic diseases (such as cancer) monitoring by studying the spatial distribution in small areas of mortality/morbidity, by means of relative and absolute risk estimators and the precision of them, to obtain a ranking of spatial units, thus being able to implement an initial spatial screening for exploratory purposes useful in epidemiological surveillance and public health and (II) to present the R-Shiny application that has been created to be able to implement this strategy in a simple way, fast and without a deep technical knowledge of the statistical and programming methods necessary to perform it.

Methods

In order to achieve these objectives, several spatial analyses with the calculation of relative and absolute estimators were conducted to identify and to rank areas of risk of a disease. The study area is divided into several non-overlapping smaller regions and the analyses can start once a measure of observed and expected cases and the population is available for each of the study regions.

As a result, a pipeline of analysis including the strategies followed to rank risk areas has been established as indicated in Figure 1 and described below. Finally, we show the Shiny web application for ranking spatial areas of risk in spatial surveillance “RANKSPA” to perform the pipeline and what results are obtained by applying this to simulated data as an example and to a real dataset of municipal lung cancer mortality.

Figure 1 Steps of the pipeline for the ranking of spatial areas.

Pipeline for ranking spatial areas

The analyses that are presented in this work deal with modelling areal data (observed cases on areal entities with defined boundaries). In these analyses the spatial autocorrelation among areal entities should be taken into account. In that way, the first step in the pipeline is to create the neighbourhood structure using a standard neighbour criterion, the contiguity, where all touching polygons are neighbours. For this purpose, “poly2nb” function of the “spdep” R package could be applied (16,17).

As part of any spatial data analysis, a series of statistical tests should be applied to assess overdispersion, spatial autocorrelation and general spatial clustering. These analyzes constitute the second step and the results will enrich the ranking process that will be discussed later. First of all, the spatial heterogeneity of the risk measures is assessed, testing for global significant differences between observed and expected cases. The reason of a possible heterogeneity may be related to many different factors, such as the presence of different medical assistance systems or pollution sources in the areas. Two common tests of overdispersion are used in this step: the standard Chi-square test and the Potthoff-Whittinghill’s test [see the parametrization described in pages 348–349 of the Bivand et al. book (17), which consists of the use of a multinomial model in the bootstrap test and 999 replicates to compute the significance of the observed value of the test statistic]. Functions in the “DCluster” R package (18) “achisq.test” and “pottwhitt.test” could be used to perform them. Moreover, two common tests of global clustering, Moran’s I test (19) and Tango’s maximised excess events test (20,21), are also performed in this step to assess global spatial autocorrelation. “moranI.test” and “tango.test” functions in the “DCluster” R package (18) could be applied to perform these tests [see the parametrization described in pages 350–352 of the Bivand et al. book (17), which consists of the use of a negative binomial model in the bootstrap test and 999 replicates to compute the significance of the observed value of the test statistic].

In the third step of the proposed pipeline, a standard test [Kulldorff and Nagarwalla test (22)] for the detection of spatial clusters based on a circular window is applied. We selected this approach because it is a well-established method in this field, easy to implement and with a relatively low computational cost, although it has several limitations as will be discussed later. The results from this analysis will enrich, with relevant information for spatial surveillance, the ranking of the risk areas that is the final result of the application of this pipeline. “opgam” function in the “DCluster” R package could be applied to performed this test [see the parametrization described in pages 353–354 of the Bivand et al. book (17), which consists of the use of a negative binomial model in the bootstrap test, 99 replicates to compute the significance of the observed value of the test statistic, a significance level of 5% for the tests performed and a 15% fraction of the total population for the circles creation in the method]. To define the window in this detection of spatial clusters, the maximum fraction of the total population inside the cluster could be used. This parameter should be changed to assess if the location of clusters varies significantly. Moreover, only clusters with P values lower than 0.05 will be taken into account.

The fourth step consists of an initial calculation of local RR estimators of disease and the distribution of the PP that RR >1 (Bayesian way to assess high risks). To do this, based on the data to be analysed, conditional autoregressive model proposed by Besag, York and Mollié (BYM) (8,23,24) would be used. This model is based on fitting a Poisson spatial model with observed cases as the dependent variable, log-expected cases as offset, and two types of random effect terms which take the following into account: (I) contiguity (spatial autocorrelation term); and (II) non-spatial heterogeneity. For more details see the article of López-Abente et al. (8). As a tool for Bayesian inference, we recommend the use of Integrated nested Laplace approximations (INLA) (25). The reason for this approach is that it allows reliable results to be obtained in a reasonable time and at much lower computational cost, when compared to the more traditional Markov Chain Monte Carlo (MCMC) method (26-28). The “inla” function in the R-INLA package (29,30) (with the option of simplified Laplace estimation of the parameters) could be used to perform the models. Moreover, the default assumptions for priors of the functions included in the package are used. Finally, in this fourth step, with the estimators, RR and PP for each spatial area, we calculate a weighted RR (WRR) by multiplying each RR by its associated PP. In this way, a penalized RR measure by the precision of the estimation is obtained, and this could help to discard some non-significant results.

The step of the pipeline before ranking the spatial areas of risk, the fifth step, consisted in the calculation of an absolute measure of risk that is defined as the difference between observed and expected (DOE) cases. An indicator similar to the standardized mortality ratio (SMR) or the standardized incidence ratio (SIR) that is the ratio between observed and expected cases but DOE could be directly more useful for health management aspects as it is expressed as a number of cases.

Finally, the last step of the pipeline, the sixth step, is the identification of those areas whose PP are greater than 80% [Richardson’s criterion (5)]. Firstly, these areas will be ranked according to the DOE and, secondly, according to the WRR. Once the areas have been ranked, they will be assigned a numerical value, starting from 1, to mark the highest level of importance. This means the highest DOE with the highest WRR. All the spatial areas where PP are lower than 80% will have the same rank value and this will be the highest numerical value assigned to the spatial areas of the first group plus one.

The final result of the pipeline is a table of ranked spatial areas attending to absolute and RR estimators and enriched with useful information of global and local clustering and spatial heterogeneity indicators to understand the ranking. This constitutes a rapid initial screening to identify areas of interest for public health.

An R script (“R_script.R”) is provided to perform the pipeline. Simulated data (“data_example.xlsx”) of observed and expected cases in 314 spatial areas and a shapefile (“map_example.zip”) are also provided for the application of the script. See “Code and data availability” section of the manuscript.

RANKSPA app

In the same way other authors try to give useful applications for spatial analysis or data management (3,4) and allow users to apply statistical methods in a simple way and without extensive technical knowledge, we have created an R-Shiny web application called RANKSPA (“RANKing SPatial Areas”) which allows users to perform the pipeline described before. In that way, RANKSPA allows to obtain a ranking of spatial areas attending to absolute and relative disease risk. RANKSPA R-Shiny code is provided (see “Code and data availability” section of this manuscript) and it is also available online ready to be used at https://biodama.shinyapps.io/rankspa/.

RANKSPA consists of one page where: (I) on the left side, the user can upload the input files and select one of the parameters that is needed for the cluster analysis (“fraction of the total population”, see “Pipeline for ranking spatial areas” section of the manuscript); (II) on the right side of the page, there are nine tabs, where an application overview and the results of the different statistical analyses carried out and the maps created by the application can be visualized and downloaded. Figure 2 shows the RANKSPA application, once the simulated data (see “Code and data availability” section of the manuscript) has been processed and specifically showing the tab corresponding to the map of RRs.

Figure 2 RANKSPA application overview.

Input data

First, we upload the data file by clicking the ‘Data input’ button and selecting the corresponding file. The file must be an Excel (XLSX format) or CSV file with the following columns:

ID: identification code of the spatial areas;
Population: total population in each spatial area;
Obs: observed cases in each spatial area;
Exp: expected cases in each spatial area.

Second, we upload the map file containing the areas of the region of study. The map file needs to be in shapefile format (files with extension .shp, .prj, .dbf, .shx). The ID area should be a unique identifier and should match with the ID area that is specified in the data. Map files can be uploaded by clicking the “Shp Input” button and selecting all the corresponding files.

After uploading the map and the data files we can specify one option related to the cluster analysis of the pipeline, the maximum fraction of the total population inside the cluster. The default value is 15%.

Start analysis

After uploading all the files and selecting the option for the cluster analysis, we click the “Process” button. When we do this, the application starts to perform the analysis of the pipeline.

Introduction tab

An application overview is shown in this tab, including a “note” indicating the precaution to be taken in the interpretation and use of the results obtained from this application.

Results tabs

The results of the analyses are shown in seven tabs. In the first one, “ranking map”, a map of the areas in the region of study is shown, which displays in colour only those areas that are relevant according to the ranking criteria of the pipeline describe before (areas with PP higher or equal to 80% and ranked first by the DOE and second by the WRR). Moreover, the colour scale used in the map (see the legend) corresponds to the sextiles of the absolute measures of risk (DOE) of the spatial areas. The numbers shown inside the spatial units correspond to the identification code ID (see description of the variables included in the ranking table tab). The application allows users to zoom in on the map and also download it in pdf format.

In the second tab, “ranking table”, a table is shown containing the relative and absolute risk estimates for the coloured areas in the map of the previous tab as well as information about whether these areas are located within a spatial cluster (the application allows users to download the table in XLSX format). Specifically, the estimators and variables shown in the table are:

Ranking: ranking position (number 1 means the highest score according to two criteria based on the excess of cases/deaths and a WRR, see below);
Highest level of importance;
ID: identification code of the spatial areas;
Population: total population in each spatial area;
Obs: observed cases in each spatial area;
Exp: expected cases in each spatial area;
SMR: standard mortality/morbidity ratio (observed/expected);
Diff_obs_exp: difference between observed and expected (DOE) cases;
RR: relative risk;
lCre: lower limit of the RR credible interval;
uCre: upper limit of the RR credible interval;
PP: posterior probability (PP) that RR>1;
cluster_1 to cluster_n: cluster membership (labels: cluster = belongs to a cluster; center = belongs to a cluster and is the central area).

Only the significant clusters by Kulldorff’s test (P value ≤0.05) are shown and the order of appearance in the table is established according to Kulldorff’s statistic, so that the cluster that present the highest value is the first (cluster_1). There are some aspects related to multiple comparisons that are not taken into account in this section, see (31). The table is sorted based on the values of the ranking variable, but users can sort this table by any of the other variables such as, for example, RRs.

In the third tab, “spatial tests”, the results from the spatial tests described in the pipeline section of the manuscript are shown. The application displays exactly the output given by R software.

In the fourth and fifth tabs, “RR map” and “PP map” respectively, RRs (smoothed SMRs) and the distributions of PPs of having an RR >1 are mapped.

In the sixth tab, “cluster analysis”, the results from the cluster analysis are shown according to the criteria described in the pipeline section of the manuscript. On the left side of the tab, a table with the significant clusters detected using Kulldorff’s test (P value ≤0.05) is shown, and the order of appearance in the table is established according to Kulldorff’s statistic, so that the cluster that presents the highest value appears first (cluster_1). Specifically, the estimators and variables shown in the table are:

Cluster: identification (as a numerical code) for the cluster;
Size: number of spatial units included in the cluster;
Statistic: value of the Kulldorff’s statistic;
P value: P value from Kulldorff’s test.

Two maps are also shown in this tab, displaying in colour those spatial units belonging to the cluster 1 and 2 which are described in the table on the left side of the tab. The spatial units which correspond to the center of the shown clusters are coloured in a darker grey than the rest.

In the seventh tab, “complete results table”, a table is shown containing the relative and absolute risk estimates for all areas in the map as well as information about whether these areas are located within a spatial cluster (the application allows users to download the table in XLSX format). The estimators and variables shown are the same as those in the table in second tab.

Finally, all the maps are available support zooming. To zoom in, users must select the area pressing the left mouse button and then make a left double click over the area selected. To zoom out, users should make left double click over the map.

Glossary tab

A description of the content of each of the application tabs is shown here.

Example with simulated data

The simulated data (“data_example.csv” and “map_example.zip”) mentioned before is based on observed and expected cases in 314 spatial areas and a shapefile (see “Code and data availability” section of the manuscript). These data have been analyzed using RANKSPA application using the default value for the fraction of the total population parameter (15%). The results obtained are shown in the “Results” section of the manuscript.

Example with real data

We have tested the pipeline on real data using the RANKSPA application. It consisted of individual death entries (observed cases) between 2005 and 2009 in men in Galicia [North-East of Spain, a region where there is a clear spatial pattern of lung cancer mortality (8)] corresponding to lung cancer (International Classification of Diseases, 10th Revision: C61), broken down by municipality (n=314). Population data (men) by town and age (18 age groups) in 2007, the midpoint of the study period, was obtained from the Spanish municipal Register. The National Statistics Institute [Instituto Nacional de Estadística (INE)] of Spain provided deaths and population data. The expected cases were calculated by multiplying the overall Spanish age-specific mortality rates (using lung cancer death entries between 2005 and 2009 in men for the whole country of Spain) for the 5-year study period by each town’s person-years (2007 population multiplied by 5). Afterwards, SMRs were computed as the ratios of the observed to the expected deaths.

We selected lung cancer due to the high mortality and incidence of this chronic disease. Lung cancer remains the leading cause of cancer incidence and mortality, with 2.1 million new lung cancer cases (11.6% of the total cases) and 1.8 million deaths predicted in 2018 (18.4% of the total cancer deaths), representing close to 1 in 5 cancer deaths (32) which makes it possible to have reliable indicators. Moreover, lung cancer survival is poor, with a relative survival in Europe of 39% and 13% at 1 and 5 years since diagnosis, respectively (33). In that way, lung cancer mortality is a good indicator for monitoring this disease.

The results obtained using RANKSPA application to this data, using the default value for the fraction of the total population parameter, are shown in “Results” section of the manuscript.

The study was conducted in accordance with the Declaration of Helsinki, and ethical approval was not required due to the type of study and design carried out.

Results

Simulated case study

The results displayed in the tabs of RANKSPA application using simulated data provided (see “Code and data availability” section of the manuscript) are shown in Figure 3. The tables are shown in the “ranking table” and “complete results table” tabs are included in the Tables S1,S2.

Figure 3 Results displayed in the tabs of RANKSPA application using simulated data. (A) Ranking map tab; (B) ranking table tab; (C) RR map tab; (D) PP map tab; (E) cluster analysis tab. RR, relative risk; PP, posterior probability.

According to these results, there is a clear excess of mortality in the middle-west of the studied area (see Figure 3C,D) where a spatial mortality cluster is also located (see Figure 3E). Moreover, there are five spatial units (IDs: 65, 293, 152, 305 and 156) located outside this cluster that occupy the top positions in the ranking generated by the application (see Figure 3A,B).

According to the pipeline that implements the RANKSPA application, among the total spatial units of the study region [314], only 14 have excess mortality whose PP are greater than or equal to 80%. All spatial tests implemented with the exception of the Moran’s I test of spatial autocorrelation provided statistically significant results (P<0.05).

Finally, in Figure 4 a scatterplot is shown displaying the relationship between DOE and WRR for the simulated data in each spatial unit, and indicating whether the PPs were above or below 80%, which was one of the criteria to be taken into account in the pipeline described above. It can be seen that there is in general a positive correlation between absolute (DOE) and relative (WRR) measures. However, a greater dispersion is observed when these measures take the highest values.

Figure 4 Scatterplots displaying the relationship between WRR and the DOE cases in simulated and real data, indicating when the PPs were above or below 80%. WRR, weighted relative risk; DOE, difference between observed and expected; PP, posterior probability.

Real case study

The results displayed in the tabs of RANKSPA application using real data are shown in Figure 5. According to them, there is a clear excess of mortality in the west of the region studied (see Figure 5C,D) where several of the spatial mortality clusters are also located (see Figure 5E). There are three spatial units (IDs: 279, 53 and 33) located outside these clusters that occupy the top positions in the ranking generated by the application (see Figure 5A,B).

Figure 5 Results displayed in the tabs of RANKSPA application using real data. (A) Ranking map tab; (B) ranking table tab; (C) RR map tab; D) PP map tab; (E) cluster analysis tab. RR, relative risk; PP, posterior probability.

Of the total spatial units of the study region [314], only 11 have excess mortality whose PP are greater than or equal to 80%. All spatial tests implemented provided statistically significant results (P<0.05).

Finally, in Figure 4 a scatterplot is shown displaying the relationship between DOE and WRR also for real data in each spatial unit, and indicating whether the PP are above or below 80%, which was one of the criteria to be taken into account in the pipeline described above. It can be seen, as it was the case with the simulated data, that there is in general a positive correlation between absolute (DOE) and relative (WRR) measures. However, there is some dispersion when these measurements take the highest values.

Discussion

In this paper, we present a pipeline of analysis that could allow us to discriminate regions of interest in the context of chronic diseases monitoring (like cancer monitoring) by studying the spatial distribution in small areas of mortality/morbidity. Taking relative and absolute risk estimators and the uncertainty about them into account, it obtains a ranking of spatial units, thus being able to implement an initial spatial screening for exploratory purposes useful in epidemiological surveillance and public health. Moreover, we introduce the RANKSPA app, an R-Shiny web application to implement this strategy in a simple way, fast and without a deep technical knowledge of the statistical and programming methods necessary to perform it. It also serves as an exploratory tool for spatial analysis since it enables visualization of maps and the detection of clusters.

As already indicated in the Introduction section, in the context of public health and surveillance of chronic diseases such as cancer, the representation and analysis of maps of events (disease mapping) within a fixed time frame/period is a basic tool and the existence of criteria to discriminate relevant results is important. Furthermore, in many cases, this process should not only take into account relative measures, but also absolute measures of risk. The shown pipeline tries to incorporate and integrate different aspects already known in the field of space surveillance, combining these risk measures.

Here we have focused on the DOE cases and the WRR to rank areas. However, this is a first approach and other measures that take into account the size of the region could be used as the DOE will likely depend on the population of the areas. For this reason, other relative measures that take into account the size of the areas (such as Chi-square distances between observed and expected) and their associated uncertainty could be considered when ranking the areas. Our preliminary assessment of such measures indicates that the ranking is very similar to that provided by the DOE.

On the other hand, the RANKSPA app, that is an open-source tool implemented using R, Shiny and incorporating the functions from several R packages, is easy to use and allows health researchers to perform all the analyses of the pipeline without the need of having advanced statistical or programming skills. As the methods for the initial screening that the pipeline implements are very common in surveillance, researchers that need to perform more complex analyses should use other statistical packages or other more complete and complex applications like SpatialEpiApp (4).

In relation to the application of the RANKSPA App to data, it is worth mentioning that the extent of the utility of the results it provides may depend largely on spatial heterogeneity. When this is very large or there is no spatial heterogeneity, it can be expected that the discrimination of areas of interest through the described pipeline is not so easy to interpret or useful at all. Moreover, in which ranges of heterogeneity may be useful will also depend on the spatial distribution of the disease. Therefore, the results obtained when using the RANKSPA application must be carefully evaluated taking into account all these aspects. For reasons of time and space in this work no results are presented analyzing other possible situations. Further studies are necessary to address other types of scenarios where changes in the pipeline could surely be found and proposed to provide a useful response in these contexts.

In relation to the proposed pipeline, some assumptions have been made regarding the neighbourhood between spatial units and the priors in spatial models that, although they are among the most used in studies about the spatial distribution of diseases such as cancer, may not be the most suitable in certain contexts. For example, rare diseases for which the observed cases are very few or even zero in most areas. Therefore, the results obtained must also be carefully evaluated taking into account all these aspects.

The real data used of lung cancer mortality represent a good example where the results may be useful in monitoring this disease. There is some spatial heterogeneity, with a clear and located spatial pattern. However, at the highest values of the risk measures, the ranking of the space units depends very much on which measure of risk (absolute as implemented directly in the RANKSPA application) or relative (the table shown in the RANKSPA app allows this visualization quickly) is taken into account first.

Finally, we have tried to design both the pipeline and the application to allow for a quick and easy-to-interpret exploratory analysis. We plan to incorporate additional methodological approaches (such as spatio-temporal analysis, cut-off combinations for posterior probabilities, other types of neighbourhoods, model settings, …) or other technical aspects related to the flexibility of data loading and interaction with the information provided, to enhance the work presented here.

As a general conclusion, the work presented shows a strategy of analysis to implement an initial screening of disease risk for exploratory purposes that allows to discriminate regions of interest by studying the spatial distribution in small areas, focused primarily on chronic diseases such as cancer, where the proposed methods are commonly used in their monitoring. Moreover, an R-Shiny application (RANKSPA) that has been created to be able to implement this strategy in a simple way, fast and without a deep technical knowledge of the statistical and programming methods necessary to perform it. However, due to the limitations of the implemented methodology, the results obtained should be treated with caution, and in no case be used as the only method for decision-making. See, for example (34), for a recent review of sound methods for spatial epidemiology and the detection of regions of high risk. Nevertheless, for an initial exploratory objective and/or to obtain in a quick way information in the context of health management, RANKSPA may be more useful than other applications.

Code and data availability

Pipeline R-code: R_script.R (https://doi.org/10.6084/m9.figshare.11635908)
RANKSPA R-Shiny code: RANKSPA.zip (https://doi.org/10.6084/m9.figshare.11635935)
Simulated data: data_example.csv (https://doi.org/10.6084/m9.figshare.11467749)
Shapefiles: map_example.zip (https://doi.org/10.6084/m9.figshare.11467755)

Software availability

https://biodama.shinyapps.io/rankspa/

Acknowledgments

Funding: The study was partially supported by research grants from the Spanish Health Research Fund (FIS PI17CIII/00040), by grants PPIC-2014-001-P and SBPLY/17/180501/000491, funded by Consejería de Educación, Cultura y Deportes (Junta de Comunidades de Castilla-La Mancha, Spain) and FEDER, grant PID2019-106341GB-I00 from Ministerio de Ciencia e Innovación (Spain) and grant MTM2016-77501-P, funded by Ministerio de Economía y Competitividad (Spain). FPP has been supported by a Ph.D. scholarship awarded by the University of Castilla-La Mancha (Spain).

Footnote

Provenance and Peer Review: This article was commissioned by the Guest Editors (Peter Baade and Susanna Cramb) for the series “Spatial Patterns in Cancer Epidemiology” published in Annals of Cancer Epidemiology. The article has undergone external peer review.

Data Sharing Statement: Available at http://dx.doi.org/10.21037/ace-20-15

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at http://dx.doi.org/10.21037/ace-20-15). The series “Spatial Patterns in Cancer Epidemiology” was commissioned by the editorial office without any funding or sponsorship. The authors have no other conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013), and ethical approval was not required due to the type of study and design carried out.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Lawson AB, Kleinman K. Spatial and syndromic surveillance for public health. Chichester: John Wiley & Sons; 2005:288.
Anselin L, Syabri I, Kho Y. GeoDa: An Introduction to spatial data analysis. Geogr Anal 2006;38:5-22. [Crossref]
Adin A, Goicoa T, Ugarte MD. Online relative risks/rates estimation in spatial and spatio-temporal disease mapping. Comput Methods Programs Biomed 2019;172:103-16. [Crossref] [PubMed]
Moraga P. SpatialEpiApp: A Shiny web application for the analysis of spatial and spatio-temporal disease data. Spat Spatiotemporal Epidemiol 2017;23:47-57. [Crossref] [PubMed]
Richardson S, Thomson A, Best N, et al. Interpreting posterior relative risk estimates in disease-mapping studies. Environ Health Perspect 2004;112:1016-25. [Crossref] [PubMed]
Elliott P, Wartenberg D. Spatial epidemiology: current approaches and future challenges. Environ Health Perspect 2004;112:998-1006. [Crossref] [PubMed]
Lawson AB, Biggeri A, Böhning D, et al. editors. Disease mapping and risk assessment for public health. 1st ed. Chichester: Wiley; 1999:502.
López-Abente G, Aragonés N, Pérez-Gómez B, et al. Time trends in municipal distribution patterns of cancer mortality in Spain. BMC Cancer 2014;14:535. [Crossref] [PubMed]
Rodriguez-Sanchez L, Fernández-Navarro P, López-Abente G, et al. Different spatial pattern of municipal prostate cancer mortality in younger men in Spain. PLoS One 2019;14:e0210980 [Crossref] [PubMed]
Baltrus P, Malhotra K, Rust G, et al. Identifying county-level all-cause mortality rate trajectories and their spatial distribution across the United States. Prev Chronic Dis 2019;16:E55 [Crossref] [PubMed]
Jiang F, Chu J, Chen X, et al. Spatial distribution and clusters of pancreatic cancer mortality in Shandong Province, China. Sci Rep 2019;9:12917. [Crossref] [PubMed]
Duncan EW, Cramb SM, Aitken JF, et al. Development of the Australian Cancer Atlas: spatial modelling, visualisation, and reporting of estimates. Int J Health Geogr 2019;18:21. [Crossref] [PubMed]
López-Abente G, Ramis R, Pollán M, et al. Atlas municipal de mortalidad por cáncer en España 1989-1998. Madrid: Área de Epidemiología Ambiental y Cáncer del Centro Nacional de Epidemiología, ISCIII; 2007.
Pickle LW; National Center for Health Statistics (U.S.). Atlas of United States mortality. Hyattsville: National Center for Health Statistics, Centers for Disease Control and Prevention, U.S. Dept. of Health and Human Services; 1996. Available online: http://catalog.hathitrust.org/api/volumes/oclc/36045997.html
Boyle P, Smans M. Atlas of Cancer Mortality in the European Union and the European Economic Area, 1993-1997. Available online: https://publications.iarc.fr/Book-And-Report-Series/Iarc-Scientific-Publications/Atlas-Of-Cancer-Mortality-In-The-European-Union-And-The-European-Economic-Area-1993-1997-2008
Bivand RS, Wong DWS. Comparing implementations of global and local indicators of spatial association. TEST 2018;27:716-48. [Crossref]
Bivand RS, Pebesma E, Gómez-Rubio V. Applied spatial data analysis with R. New York: Springer; 2013:415.
Gómez-Rubio V, Ferrándiz-Ferragud J, López-Quílez A. Detecting clusters of disease with R. J Geogr Syst 2005;7:189-206. [Crossref]
Moran PA. Notes on continuous stochastic phenomena. Biometrika 1950;37:17-23. [Crossref] [PubMed]
Tango T. A class of tests for detecting 'general' and 'focused' clustering of rare diseases. Stat Med 1995;14:2323-34. [Crossref] [PubMed]
Tango T. A test for spatial disease clustering adjusted for multiple testing. Stat Med 2000;19:191-204. [Crossref] [PubMed]
Kulldorff M, Nagarwalla N. Spatial disease clusters: detection and inference. Stat Med 1995;14:799-810. [Crossref] [PubMed]
Besag J, York J, Mollié A. Bayesian image restoration, with two applications in spatial statistics. Ann Inst Stat Math 1991;43:1-20. [Crossref]
López-Abente G, García-Gómez M, Menéndez-Navarro A, et al. Pleural cancer mortality in Spain: time-trends and updating of predictions up to 2020. BMC Cancer 2013;13:528. [Crossref] [PubMed]
Rue H, Martino S, Copin N. Approximate Bayesian Inference for latent Gaussian models using integrated nested laplace aproximations. J R Stat Soc Ser B 2009;71:319-92. [Crossref]
Carroll R, Lawson AB, Faes C, et al. Comparing INLA and OpenBUGS for hierarchical Poisson modeling in disease mapping. Spat Spatiotemporal Epidemiol 2015;14-15:45-54. [Crossref] [PubMed]
Rue H, Riebler A, Sørbye SH, et al. Bayesian computing with INLA: a review. Annu Rev Stat Appl 2017;4:395-421. [Crossref]
De Smedt T, Simons K, Van Nieuwenhuyse A, et al. Comparing MCMC and INLA for disease mapping with Bayesian hierarchical models. Arch Public Health 2015;73:O2. [Crossref]
The R-Inla Package. Available online: http://www.r-inla.org/download
Gómez-Rubio V. Bayesian inference with INLA. Boca Raton: CRC Press, 2020.
Gómez-Rubio V, Moraga P, Molitor J, et al. DClusterm: Model-based detection of disease clusters. J Stat Softw 2019;90:1-26. [Crossref]
Bray F, Ferlay J, Soerjomataram I, et al. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2018;68:394-424. [Crossref] [PubMed]
De Angelis R, Sant M, Coleman MP, et al. Cancer survival in Europe 1999-2007 by country and age: results of EUROCARE--5-a population-based study. Lancet Oncol 2014;15:23-34. [Crossref] [PubMed]
Lawson AB, Banerjee S, Haining RP, et al. editors. Handbook of Spatial Epidemiology. 1st ed. New York: Chapman and Hall/CRC; 2016.

doi: 10.21037/ace-20-15
Cite this article as: Fernández-Navarro P, González-Palacios J, González-Sánchez M, Ramis R, Nuñez O, Palmí-Perales F, Gómez-Rubio V. Ranking spatial areas by risk of cancer: modelling in epidemiological surveillance. Ann Cancer Epidemiol 2020;4:10.

Ranking spatial areas by risk of cancer: modelling in epidemiological surveillance

Introduction

Methods

Pipeline for ranking spatial areas

RANKSPA app

Input data

Start analysis

Introduction tab

Results tabs

Glossary tab

Example with simulated data

Example with real data

Results

Simulated case study

Real case study

Discussion

Code and data availability

Software availability

Acknowledgments

Footnote

References

Article Options

Download Citation

Share