This report is a sub-module of the bigger Shiny-based Visual Analytics Application (Shiny-VAA) project under ISSS608 Visual Analytics.
With the limited police resources and possible adverse impact when crime occurs, analytics on crime has been done as far back as in the 1800s (Hunt, 2019). Crime occurrence was found to have spatial patterns, and thus predictive analytics should be possible.Traditionally, crime analysis is done manually or through a spreadsheet program (RAND Corporation, 2013). This project would give the users an easier way to do the analysis using a web application.
This report is a sub-module of the bigger Shiny-based Visual Analytics Application (Shiny-VAA) project under ISSS608 Visual Analytics.
For this report, it provides the details of using Geographically weighted regression (GWR) to explore spatial heterogeneity in the relationship between variables. Instead of using least square linear regression models, GWR is done to explore whether the relationship between the independent variables (Population Characteristics) and the outcome variable (Crime Rate) is changing geographically.
A borough includes wards, which is the primary unit of English electoral geography for civil parishes and district councils. There are a total of 32 boroughs in Greater London, excluding the City of London. GWR will be done on the borough level as most of the independent variables are available on the aggregated borough level.
The most common method to displaying regression data with reference to geo-spatial weighted regression for a SpatialPolgyon dataframe will be a choropleth map showing residuals. (Medina, J. & Solymosi, R., 2019)
However, other than using residuals to fill up the choropleth map, these are the other values that are represented in the choropleth to visualize the output of a GWR model, which can be useful to the users as well.
These are part of the output GW.diagnostic provided by the GWmodel package.
Depending on the user’s requirements, the values will provide an insight of its geo-spatial differences. For instance, using local R2 in a choropleth map can show how strong the model is across polygons i.e. how many percentage of the output variable can be explained by the input variables in the model.
The datasets are taken from the London Datastore https://data.london.gov.uk/ and are as followed
MPS Ward Level Crime Data
Land Area and Population
Income of Taxpayers
Economic Activity, Employment Rate and Unemployment Rate by Ethnic Group and Nationality
The R packages used for this assignment are:
1. For data Cleaning and data wrangling: tidyverse, including dplyr, lubridate, readxl
2. Choropleth Mapping: tmap , leaflet
3. For interactivity: plotly
4. For Geo-Spatial Statistical Modeling: sf (Spatial data is required for GWR) , GWmodel, spgwr, spdep (to test for spatial correlation)
The list of packages required for this report is loaded into Rstudio, as seen in the code chunk below.
packages = c("tidyverse","lubridate", "readxl", "sf","tmap","leaflet", "plotly",
"spgwr","GWmodel","corrplot", "spdep")
for (p in packages){
if (!require(p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
2.2.1 Loading Data set into R
The MPS Ward level crime data has been cleaned up by one of the group members - Hui Yong, where the data has been aggregated by borough level. The crime type classification differs in 2019 as compared to previous years. For instance, “Arson and Criminal Damage”and “Drug Offenses” are only available in 2019. Hence, the crime type used in 2019 are renamed to follow the same crime type classification used in 2010 to 2018.
borough_crime <- read_csv("data/Crime_2010_2019.csv")
glimpse(borough_crime)
Rows: 30,336
Columns: 6
$ X1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ~
$ Borough <chr> "Barking and Dagenham", "Barking and Dagenh~
$ GSS_Code <chr> "E09000002", "E09000002", "E09000002", "E09~
$ MajorText <chr> "Arson and Criminal Damage", "Burglary", "D~
$ Date <date> 2010-04-01, 2010-04-01, 2010-04-01, 2010-0~
$ count_by_borough <dbl> 225, 167, 85, 101, 58, 13, 568, 429, 204, 1~
2.2.2 Data Wrangling
The headers are renamed: MajorText refers to the type of crime whereas count_by_borough refers to the number of crime committed for each crime type. The data for year 2010 is incomplete for the year as it only started in April, hence filtered out.
Note: There is a new crime type - Public Order Offences in Year 2019. It is assumed that some crimes are re-categorized into this crime type and hence “Public Order Offences” is not eliminated in this project.
borough_crime <- borough_crime %>%
rename(CrimeType = MajorText, Crime_No = count_by_borough) %>%
select(Borough,CrimeType,Date,Crime_No) %>%
filter(Date > as.Date("2010-12-12"))
The data is then pivoted to have each column representing the number of crime for each month. Any data entry with NA values will be replaced with the value 0.
borough_crime_month <- borough_crime %>%
group_by(Borough,CrimeType) %>%
spread(Date,Crime_No) %>%
ungroup()
borough_crime_month[is.na(borough_crime_month)] <- 0
The data is then grouped by year. Crime rate will be calculated as total number of crime/population. As population data is only available annually, the total number of crime is aggregated on the yearly level.
borough_crime_year <- borough_crime_month
borough_crime_year$Y2011 <- as.numeric(apply(borough_crime_year[,3:14], 1, sum))
borough_crime_year$Y2012 <- as.numeric(apply(borough_crime_year[,15:26], 1, sum))
borough_crime_year$Y2013 <- as.numeric(apply(borough_crime_year[,27:38], 1, sum))
borough_crime_year$Y2014 <- as.numeric(apply(borough_crime_year[,39:50], 1, sum))
borough_crime_year$Y2015 <- as.numeric(apply(borough_crime_year[,51:62], 1, sum))
borough_crime_year$Y2016 <- as.numeric(apply(borough_crime_year[,63:74], 1, sum))
borough_crime_year$Y2017 <- as.numeric(apply(borough_crime_year[,75:86], 1, sum))
borough_crime_year$Y2018 <- as.numeric(apply(borough_crime_year[,87:98], 1, sum))
borough_crime_year$Y2019 <- as.numeric(apply(borough_crime_year[,99:110], 1, sum))
borough_crime_year <- borough_crime_year %>%
select(Borough, CrimeType, Y2011:Y2019)
A simple exploratory data analysis is done to understand the distribution of number of crime by crime type across the 32 borough.
The data is pivoted to include “Year” as a variable.
UK_CrimeType <- borough_crime_year%>%
rename(`2011`= Y2011,
`2012`= Y2012,
`2013`= Y2013,
`2014`= Y2014,
`2015`= Y2015,
`2016`= Y2016,
`2017`= Y2017,
`2018`= Y2018,
`2019`= Y2019,
) %>%
gather(Year, Crime_No,`2011`:`2019`)
head(UK_CrimeType)
# A tibble: 6 x 4
Borough CrimeType Year Crime_No
<chr> <chr> <chr> <dbl>
1 Barking and Dagenh~ Arson and Criminal Damage 2011 2184
2 Barking and Dagenh~ Burglary 2011 2356
3 Barking and Dagenh~ Drug Offences 2011 988
4 Barking and Dagenh~ Miscellaneous Crimes Against Soc~ 2011 973
5 Barking and Dagenh~ Public Order Offences 2011 0
6 Barking and Dagenh~ Robbery 2011 1029
The code chunk below utilises facet_wrap to generate a summary of 9 crime types over the year 2010 to 2019. In the visualisation below, the crime type - theft and handling shows some outliers in certain boroughs and the volume of theft and handline is higher than the rest of the crime type.
As the facets use standardized axis value, it is hard to visualise the distribution of crime for the other 8 offences as they are all below 20,000 mark.
bp_crime <- ggplot(data=UK_CrimeType,
aes(x=Year, y= Crime_No)) +
labs(title = "Distribution of Number of Crime By Crime Type Over Year",
x = "Year", y = "Number of Crime") +
geom_boxplot(alpha=0.5) +
facet_wrap(~CrimeType,ncol = 3) +
theme(axis.text.x = element_text(angle = 90, size=7),
strip.text.x = element_text(size=7))
bp_crime

The code chunk below generates another grid, excluding Theft and Handling.
UK_CrimeType_ExTheft <- UK_CrimeType%>%
filter(CrimeType != "Theft and Handling")
bp_crime_extheft <- ggplot(data=UK_CrimeType_ExTheft, aes(x=Year, y= Crime_No)) +
labs(title = "Distribution of Number of Crime By Crime Type Over Year",
x = "Year", y = "Number of Crime")+
geom_boxplot(alpha=0.5) +
facet_wrap(~CrimeType,ncol = 3) +
theme(axis.text.x = element_text(angle = 90, size=7),
strip.text.x = element_text(size=7))
bp_crime_extheft

The code chunk below generates another dot plot on Theft and Handling. For better interactivity, ggploty is used so that users are able to hover over the data points to obtain more information the outliers, in this case: the borough.
UK_CrimeType_Theft <- UK_CrimeType%>%
filter(CrimeType == "Theft and Handling")
bp_crime_theft <- ggplot(data=UK_CrimeType_Theft,
aes(x=Year, y= Crime_No, text = Borough)) +
labs(title = "Distribution of Theft Crime Over Year",
x = "Year", y = "Number of Crime") +
geom_point(alpha=0.3, position = position_jitter(0.1))
ggplotly(bp_crime_theft,tooltip = c("text"))
As seen in the dot plot, Westminster has exceptionally high number of theft and handling crime as compared to the other borough. In the next section, crime rate will be calculated to provide more accurate comparison across borough.
The code chunk below shows that in this assignment, Theft and Handling is filtered out for further analysis.
crime_theft <- borough_crime_year %>%
filter(CrimeType == "Theft and Handling")
pop <- read_csv("data/housing-density-ward.csv")
head(pop,5)
# A tibble: 5 x 9
Code Borough Ward_Name Year Population Hectares Square_Kilometr~
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 E0500~ Barking~ Abbey 2011 12904 128. 1.28
2 E0500~ Barking~ Alibon 2011 10468 136. 1.36
3 E0500~ Barking~ Becontree 2011 11638 128. 1.28
4 E0500~ Barking~ Chadwell~ 2011 10098 338 3.38
5 E0500~ Barking~ Eastbrook 2011 10581 345. 3.45
# ... with 2 more variables: Population_per_hectare <dbl>,
# Population_per_square_kilometre <dbl>
The code chunk below shows the aggregation of population data by borough level, with only year 2011 to 2019 being filtered out. The population density of each borough is re-calculated. The data is then pivoted by Year.
pop_borough <- pop %>%
select("Borough", "Year","Population","Square_Kilometres") %>%
group_by(Borough,Year) %>%
summarise(across(where(is.numeric), list(sum = sum))) %>%
filter(Year < 2020) %>%
rename(Pop = Population_sum) %>%
rename(Square_Km = Square_Kilometres_sum)
pop_borough_year <- pop_borough %>%
spread(Year,Pop)
The population density is calculated as well and the columns are renamed accordingly
pop_borough_year$Pop_sqkm_2011 = pop_borough_year$`2011` / pop_borough_year$Square_Km
pop_borough_year$Pop_sqkm_2012 = pop_borough_year$`2012` / pop_borough_year$Square_Km
pop_borough_year$Pop_sqkm_2013 = pop_borough_year$`2013` / pop_borough_year$Square_Km
pop_borough_year$Pop_sqkm_2014 = pop_borough_year$`2014` / pop_borough_year$Square_Km
pop_borough_year$Pop_sqkm_2015 = pop_borough_year$`2015` / pop_borough_year$Square_Km
pop_borough_year$Pop_sqkm_2016 = pop_borough_year$`2016` / pop_borough_year$Square_Km
pop_borough_year$Pop_sqkm_2017 = pop_borough_year$`2017` / pop_borough_year$Square_Km
pop_borough_year$Pop_sqkm_2018 = pop_borough_year$`2018` / pop_borough_year$Square_Km
pop_borough_year$Pop_sqkm_2019 = pop_borough_year$`2019` / pop_borough_year$Square_Km
pop_borough_year <- pop_borough_year %>%
rename(Pop_2011 = `2011`,
Pop_2012 = `2012`,
Pop_2013 = `2013`,
Pop_2014 = `2014`,
Pop_2015 = `2015`,
Pop_2016 = `2016`,
Pop_2017 = `2017`,
Pop_2018 = `2018`,
Pop_2019 = `2019`,
)
The code chunk below combines both theft crime data and population data
UK_data <- merge(crime_theft,pop_borough_year,by.x="Borough",by.y= "Borough")
The code chunk shows the calculation of crime rate, using the calcuation: number of crime divided by population size. This will allow a fair comparison across boroughs with different population size.
UK_data$Crime_2011 = UK_data$`Y2011` / UK_data$Pop_2011 *100
UK_data$Crime_2012 = UK_data$`Y2012` / UK_data$Pop_2012 *100
UK_data$Crime_2013 = UK_data$`Y2013` / UK_data$Pop_2013 *100
UK_data$Crime_2014 = UK_data$`Y2014` / UK_data$Pop_2014 *100
UK_data$Crime_2015 = UK_data$`Y2015` / UK_data$Pop_2015 *100
UK_data$Crime_2016 = UK_data$`Y2016` / UK_data$Pop_2016 *100
UK_data$Crime_2017 = UK_data$`Y2017` / UK_data$Pop_2017 *100
UK_data$Crime_2018 = UK_data$`Y2018` / UK_data$Pop_2018 *100
UK_data$Crime_2019 = UK_data$`Y2019` / UK_data$Pop_2019 *100
UK_data <- UK_data %>%
select(-Y2011:-Y2019)
Population density is calculated earlier in section 2.3 while calculating total population per borough.
Loading Dataset into R
The dataset below provides population data, median income and mean income from 1999 to 2017. The borough level data are extracted and city of London is excluded. Only 2011 to 2017 income data are extracted for this project.
income <- read_xlsx("data/income-of-tax-payers.xlsx", sheet = "Total Income") %>%
select (1:2,36:56) %>%
slice_head(n = 35) %>%
filter(Area != "City of London") %>%
drop_na()
Data Wrangling
The code chunk below shows the steps of data preparation, including the renaming and filtering of data for easier referencing.
names(income)[4] <- "Mean_Income_2011"
names(income)[5] <- "Medium_Income_2011"
names(income)[7] <- "Mean_Income_2012"
names(income)[8] <- "Medium_Income_2012"
names(income)[10] <- "Mean_Income_2013"
names(income)[11] <- "Medium_Income_2013"
names(income)[13] <- "Mean_Income_2014"
names(income)[14] <- "Medium_Income_2014"
names(income)[16] <- "Mean_Income_2015"
names(income)[17] <- "Medium_Income_2015"
names(income)[19] <- "Mean_Income_2016"
names(income)[20] <- "Medium_Income_2016"
names(income)[22] <- "Mean_Income_2017"
names(income)[23] <- "Medium_Income_2017"
income <- income %>%
select(Area,contains("Income"))
income$Area[income$Area == "Kingston-upon-Thames"] <- "Kingston upon Thames"
income$Area[income$Area == "Richmond-upon-Thames"] <- "Richmond upon Thames"
The dataset below provides population data, median income and mean income from 1999 to 2017. The borough level data are extracted and city of London is excluded. Only 2011 to 2017 income data are extracted for this project.
Data Wrangling
The code chunk below shows the data preparation for the unemployment rate and economic inactivity data for 2011. The steps below are repeated for year 2011 to year 2019 using a loop.
Years = c("2011","2012","2013","2014","2015","2016","2017","2018","2019")
for (i in Years) {
EA <- read_xls("data/ea-rate-and-er-by-eg-and-nation.xls", sheet = i) %>%
select (1,2,19,23,27,31,35,39,43,47,20,24,28,32,36,40,44,48,21,25,29,33,37,41,45,49) %>%
slice_head(n = 36) %>%
filter(Area != "City of London") %>%
drop_na()
EA[ , 3:26] <- apply(EA[ , 3:26], 2, function(x) as.numeric(as.character(x)))
EA[is.na(EA)] <- 0
EA$Unem_rate<- as.numeric(apply(EA[,3:6], 1, sum))/
as.numeric(apply(EA[,11:14],1,sum)) * 100
EA$Einactive_rate <- as.numeric(apply(EA[,7:10], 1, sum)) /
as.numeric(apply(EA[,15:18],1,sum)) * 100
names(EA)[19] <- paste("Unem_WhiteUK_",i,sep="")
names(EA)[20] <- paste("Unem_WhitenotUK_",i,sep="")
names(EA)[21] <- paste("Unem_MinorUK_",i,sep="")
names(EA)[22] <- paste("Unem_MinornotUK_",i,sep="")
names(EA)[23] <- paste("EInActive_WhiteUK_",i,sep="")
names(EA)[24] <- paste("EInActive_WhitenotUK_",i,sep="")
names(EA)[25] <- paste("EInActive_MinorUK_",i,sep="")
names(EA)[26] <- paste("EInActive_MinornotUK_",i,sep="")
names(EA)[27] <- paste("Unem_rate_",i,sep="")
names(EA)[28] <- paste("Einactive_rate_", i,sep="")
EA<- EA %>%
select(2,19:28)
write.csv(EA, paste("data/EA/EA",i,".csv", sep = ""), row.names = FALSE)
}
The code chunk below shows the combination of all the 9 CSV files into 1 data frame
allfile <- list.files(path="data/EA", pattern="*.csv", full.names=TRUE)
EA <- read_csv("data/EA/EA2011.csv")
for (i in allfile) {
if (i != "data/EA/EA2011.csv"){
EA_temp <- read_csv(i)
EA <- left_join(EA,EA_temp, by = c("Area"="Area"))
}
}
The SHP data loaded is using the Borough layer. The dataset is taken from London DataStore. https://data.london.gov.uk/dataset/statistical-gis-boundary-files-london. This is a multipolygon geometry type SHP file. The code chunk below loads the SHP file into LAD_data.
LAD_data <- st_read(dsn = "data/geospatial/London-wards-2018_ESRI",
layer = "London_Borough_Excluding_MHW")
Reading layer `London_Borough_Excluding_MHW' from data source `C:\tohms\blog2\_posts\2021-04-12-assignment\data\geospatial\London-wards-2018_ESRI' using driver `ESRI Shapefile'
Simple feature collection with 33 features and 7 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 503568.2 ymin: 155850.8 xmax: 561957.5 ymax: 200933.9
projected CRS: OSGB 1936 / British National Grid
The initial EDA done in section 2.3 is unable to show the location of the boroughs. In this section, choropleth maps are built for users to understand the location of boroughs and the corresponding values of the independent variables and dependent variables.
The code chunk below creates “Year” as a variable.
UK_Crime_Rate <- UK_data%>%
select (Borough, Crime_2011:Crime_2019) %>%
rename(`2011`= Crime_2011,
`2012`= Crime_2012,
`2013`= Crime_2013,
`2014`= Crime_2014,
`2015`= Crime_2015,
`2016`= Crime_2016,
`2017`= Crime_2017,
`2018`= Crime_2018,
`2019`= Crime_2019,
) %>%
gather(Year, Crime_Rate,`2011`:`2019`)
The code chunk below combines the crime rate data with the geometry details required to create a choropleth map based on theft crime rate.
UK_Crime_Rate_Geo<- left_join(LAD_data,UK_Crime_Rate,by = c("NAME"="Borough")) %>%
drop_na(Crime_Rate)
The code chunk below shows a choropleth map, where crime rates are high in the center of UK.As the distribution of crime rate is mainly between 5% , it is hard to differentiate further the crime rate between these boroughs.
crime_map1 = tm_shape(UK_Crime_Rate_Geo) +
tm_polygons("Crime_Rate",title = "Crime Rate (%)", n = 5,palette="Blues") +
tm_facets(by = "Year", ncol=3)
crime_map1

The code chunk below shows a choropleth map by quantile, and the top 20% quantile in terms of theft crime rate is concentrated in the center of
crime_map2 = tm_shape(UK_Crime_Rate_Geo) +
tm_polygons("Crime_Rate", title = "Crime Rate (%)", n = 5, style = "quantile",palette="Blues") +
tm_facets(by = "Year", ncol=3)
crime_map2

The code chunk below shows a histogram of the theft crime rate in 2019, whereby most boroughs have the crime rate between 2.5 to 3.5%.
theft_2017_ht <- ggplot(data = UK_data, aes(x= Crime_2017)) +
geom_histogram(bins=20) +
labs(title = "Distribution of Theft Crime across Borough in 2017",
x = "Crime Rate (%)", y = "Number of Borough") +
scale_y_discrete(limits=c(0,2,4,6,8,10,12,14)) +
scale_x_continuous(limits=c(0,9), breaks = seq(1,9,1))
ggplotly(theft_2017_ht,tooltip=c("y"))
The code chunk below creates an interactive Choropleth map showing the top 20% boroughs with the highest theft crime rate in 2017. Interesting, the top 20% are neighbouring boroughs.
UK_Crime_Rate_Geo_2017 <- UK_Crime_Rate_Geo %>%
filter(Year == 2017)
crime_map_2017 <- tm_shape(UK_Crime_Rate_Geo_2017) +
tm_polygons("Crime_Rate", title = "Crime Rate(%)",
n = 5, style = "quantile",palette = "Blues")
tmap_leaflet(crime_map_2017)
The code chunk below shows the transformation of income dataset with “Year”as a variable.
mean_income_year <- income %>%
select(Area, contains("Mean_")) %>%
rename_with(.cols = 2 , ~"2011") %>%
rename_with(.cols = 3 , ~"2012") %>%
rename_with(.cols = 4 , ~"2013") %>%
rename_with(.cols = 5 , ~"2014") %>%
rename_with(.cols = 6 , ~"2015") %>%
rename_with(.cols = 7 , ~"2016") %>%
rename_with(.cols = 8 , ~"2017") %>%
gather(Year, Mean_Income,`2011`:`2017`)
Geometry details are added in.
mean_income_year_Geo<- left_join(LAD_data,mean_income_year,by = c("NAME"="Area")) %>%
drop_na(Mean_Income)
A choropleth is created to show the distribution of mean income across boroughs over the years.
mean_income_map = tm_shape(mean_income_year_Geo) +
tm_polygons("Mean_Income",title = "Mean Income", n = 8, style = "quantile",palette = "Blues") +
tm_facets(by = "Year", ncol=3)
mean_income_map

Interesting, the boroughs with the top 20% mean income are concentrated together and at the central of London.
The code chunk below shows the transformation of EA dataset with “Year”as a variable and filtering only total unemployment data for each year.
Unem_year <- EA %>%
select(Area,contains("Unem_rate_")) %>%
rename(`2011` = "Unem_rate_2011",
`2012` = "Unem_rate_2012",
`2013` = "Unem_rate_2013",
`2014` = "Unem_rate_2014",
`2015` = "Unem_rate_2015",
`2016` = "Unem_rate_2016",
`2017` = "Unem_rate_2017",
`2018` = "Unem_rate_2018",
`2019` = "Unem_rate_2019") %>%
gather(Year, Unem_rate,`2011`:`2019`)
Geometry data is added in and City of London is excluded.
Unem_year_Geo<- left_join(LAD_data, Unem_year,by = c("NAME"="Area")) %>%
drop_na(Unem_rate)
A choropleth is created to show the distribution of unemployment rate across boroughs over the years.
Unem_map = tm_shape(Unem_year_Geo) +
tm_polygons("Unem_rate", title ="Unemployment rate (%)", n = 8, style = "pretty",palette = "Blues") +
tm_facets(by = "Year", ncol=3)
Unem_map

Based on the choropleth maps, unemployment rate has reduced over the years. The unemployment rate tends to be randomly distributed and might not be a variable correlated to theft crime rate.
The code chunk below combines all the variables dataset together, namely crime rate data, population density, income and economic activities.
UK_DF <- left_join(UK_data, income, by = c("Borough"="Area"))
UK_DF <- left_join(UK_DF, EA, by = c("Borough"="Area"))
UK_DF_Geo <- left_join(LAD_data,UK_DF,by = c("NAME"="Borough")) %>%
drop_na(Square_Km)
Using Year 2017 variables, a simple linear regression formula is built as seen in the code chunk below.
Call:
lm(formula = model_1, data = UK_DF)
Residuals:
Min 1Q Median 3Q Max
-3.5729 -0.9459 -0.1201 0.4848 5.2667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.028e+00 5.444e-01 1.888 0.0687 .
Mean_Income_2017 4.938e-05 9.092e-06 5.432 6.89e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.529 on 30 degrees of freedom
Multiple R-squared: 0.4958, Adjusted R-squared: 0.479
F-statistic: 29.5 on 1 and 30 DF, p-value: 6.889e-06
In this linear model, the Adjusted R-squared is 0.479. 47% of the Crime Rate of 2017 across the 32 borough can be explained with the following equation:
Crime Rate of 2017 = 4.938e-05 x (Mean Income of 2017) + 1.028
The code chunk below plots the best fit line of the simple regression - model 1.
ggplot(data=UK_DF, aes(y=`Crime_2017`, x=`Mean_Income_2017`)) +
geom_point() +
labs(title = "Simple Regression of Mean Income and Crime rate (%) in 2017",
x="Mean Income", y = "Crime Rate(%)") +
geom_smooth(method = lm)

There are a couple of data points that are outliers and hence, the code chunk below labels these data points for easier references and provides an annotation of the Adjusted R value of this regression line. This simple linear regression model has room for improvement and not the best model for crime rate.
ggplot(data=UK_DF, aes(y=`Crime_2017`, x=`Mean_Income_2017`)) +
geom_point(color="steelblue") +
labs(title = "Simple Regression of Mean Income and Crime rate (%) in 2017",
x="Mean Income", y = "Crime Rate(%)") +
geom_smooth(method = lm, color = "steelblue") +
geom_text(aes(label = ifelse(`Mean_Income_2017`>75000,Borough,'')),
size = 3, colour ="indianred", hjust = 1) +
annotate("text", x = 75000, y = 10,
label = "Adjusted R-squared: \n 0.479", color = "skyblue4")

Before expanding to a multiple linear regression, it is important to check for any collinearity between variables. The code chunk below filters out all 2017 variables.
UK_DF_Cor <- UK_DF %>%
select(contains("2017"),-"Crime_2017")
A matrix of the relationship between the variables is plotted. As there are a large volume of variables, the method “circle” is chosen for easier visulisation.
corrplot(cor(UK_DF_Cor), diag = FALSE, order = "AOE", method = "circle",
type = "upper", tl.pos = "td", tl.cex = 0.5, tl.col ="black")

The following variables are removed - Medium Income, Einactive Rate of Minority (Not UK) and Einactive Rate of White (UK) and Unemployment rate of Minority (Not UK)
model_2 <- (Crime_2017 ~ Mean_Income_2017 + Pop_sqkm_2017 + Unem_rate_2017 +
Einactive_rate_2017+ EInActive_MinorUK_2017 + EInActive_WhitenotUK_2017
+ Unem_rate_2017 + Unem_WhiteUK_2017 + Unem_WhitenotUK_2017 +
Unem_MinorUK_2017)
results_2 <- lm(model_2,UK_DF)
summary(results_2)
Call:
lm(formula = model_2, data = UK_DF)
Residuals:
Min 1Q Median 3Q Max
-2.7842 -0.5810 -0.1767 0.1947 4.0131
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.060e+00 1.288e+00 -1.600 0.1239
Mean_Income_2017 2.558e-05 1.067e-05 2.397 0.0255 *
Pop_sqkm_2017 2.299e-04 9.393e-05 2.448 0.0228 *
Unem_rate_2017 -3.636e-01 2.336e-01 -1.556 0.1339
Einactive_rate_2017 1.564e-01 9.857e-02 1.587 0.1269
EInActive_MinorUK_2017 -2.789e-02 3.870e-02 -0.721 0.4787
EInActive_WhitenotUK_2017 -8.213e-04 6.697e-02 -0.012 0.9903
Unem_WhiteUK_2017 2.295e-01 1.245e-01 1.843 0.0788 .
Unem_WhitenotUK_2017 8.305e-02 9.139e-02 0.909 0.3733
Unem_MinorUK_2017 7.435e-02 6.507e-02 1.143 0.2655
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.331 on 22 degrees of freedom
Multiple R-squared: 0.7196, Adjusted R-squared: 0.6049
F-statistic: 6.275 on 9 and 22 DF, p-value: 0.0002153
The code chunk shows the final multi regression model, with an improvement in Adjusted R squared from 0.479 to 0.631.
model_3 <- (Crime_2017 ~ Mean_Income_2017 + Pop_sqkm_2017 + Unem_WhiteUK_2017)
results_3 <- lm(model_3,UK_DF)
summary(results_3)
Call:
lm(formula = model_3, data = UK_DF)
Residuals:
Min 1Q Median 3Q Max
-3.0917 -0.5700 -0.1301 0.3153 4.6613
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.321e-01 6.441e-01 -0.981 0.334787
Mean_Income_2017 3.340e-05 8.747e-06 3.818 0.000683 ***
Pop_sqkm_2017 2.351e-04 6.767e-05 3.474 0.001687 **
Unem_WhiteUK_2017 1.775e-01 8.578e-02 2.069 0.047888 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.287 on 28 degrees of freedom
Multiple R-squared: 0.6668, Adjusted R-squared: 0.6311
F-statistic: 18.68 on 3 and 28 DF, p-value: 7.486e-07
The code chunk below shows the converting of UK data set into a SpatialPolygonDataFrame.
UK_sp <- as_Spatial(UK_DF_Geo)
As there are geo-spatial attributes, it is good to explore and visulise the residual of the multiple linear regression. The code chunk below converts the results of Model 3 into a data frame.
model_3_output <- as.data.frame(results_3$residuals)
The residual results are combined with the UK dataframe with the geometry data.
model_3_output <- cbind(UK_DF_Geo,results_3$residuals) %>%
rename(`Residual` = `results_3.residuals`)
Lastly, the entire data set is created into a SpatialPolgyonData Frame
model_3_output_sp <- as_Spatial(model_3_output)
With the sp dataframe, the code chunk below shows an interactive choropleth map using tmap function, with the residuals as the fill of the polgyons.
model_3_res = tm_shape(model_3_output) +
tm_polygons("Residual", n = 4, style = "pretty", palette = "RdBu")
tmap_leaflet(model_3_res)
There seems to be signs of spatial autocoreelation whereby boroughs with positive residuals are neighboring boroughs (Medina, J. & Solymosi, R., 2019) as the over-prediction (negative residuals in red) and the under-prediction (positive residuals in blue) shows a spatial patterning. Over-prediction tends to appear for the North-east boroughs whereas under-prediction tends to appear for boroughs located at the South-west.
Firstly, “neighboring” polygons are defined. Neighbouring polygons refer to contiguous polygons, polygons within a certain distance band. (Gimond, M., 2019) The poly2nb function from spdep is used.
nb <- poly2nb(UK_sp)
The code chunk below shows the assignment of weights to each neighboring polygon. Using style = “W”, neighboring polygon will be assigned equal weight using the fraction of 1 divided by neighbours to each neighboring county then summing the weighted income values.
lw <- nb2listw(nb, style = "W", zero.policy = TRUE )
Lastly, the code chunk below shows the Moran’s I test and the results.
lm.morantest(results_3,lw)
Global Moran I for regression residuals
data:
model: lm(formula = model_3, data = UK_DF)
weights: lw
Moran I statistic standard deviate = 2.5337, p-value =
0.005643
alternative hypothesis: greater
sample estimates:
Observed Moran I Expectation Variance
0.26135887 -0.04057091 0.01420012
Interesting, as the p-value is 0.005643 and less than the alpha value of 0.05, we will hence reject the null hypothesis that the residuals are randomly distributed. As the observed Moran I is greater than 0, it can be inferred that the residuals resemble a cluster distribution.
With the results in Section 4, there is spatial auto-correlation. As such, geo-spatial relationship between variables should be factored in. With Geographically weighted regresion (GWR), there are several factors involved in building a GWR model. They are i)bandwidth, ii) kernel used and lastly iii) p, which is the power of the Minkoswki distance. (Kam, T. S ,2019)
Bandwidth is used in the weighting function and it can either by calculated or adaptive based of the number of nearest neighbors.
For this assignment, the kernel chosen will be Exponential and 2 GWR models will be calibrated based on fixed bandwidth and adaptive bandwidth.
The code chunk below converts the entire UK dataset into a SpatialDataFrame, in this case, a SpatialPolygonsDataFrame.
UK_DF_sp <- as_Spatial(UK_DF_Geo)
The code chunk below shows the gwrmodel and the variables used also found in the multiple linear regression model - Model 3 under Section 4.2.2
gwrmodel1 <- (Crime_2017 ~ Mean_Income_2017 + Pop_sqkm_2017 + Unem_WhiteUK_2017)
In the code chunk below, bw.gwr() of GWModel package is used to determine the optimal fixed bandwidth to use in the model. The adaptive argument is put as “False” instead. There are two possible approaches can be used to determine the stopping rule, they are: CV cross-validation approach and AIC corrected (AICc) approach. The bandwidth is derived by using the AICc approach argument.
bw_fixed <- bw.gwr (formula = gwrmodel1, data= UK_DF_sp,
approach = "AICc", kernel = "exponential",
adaptive = FALSE, longlat = FALSE)
Fixed bandwidth: 28625 AICc value: 115.6812
Fixed bandwidth: 17694.76 AICc value: 117.1956
Fixed bandwidth: 35380.26 AICc value: 115.3993
Fixed bandwidth: 39555.24 AICc value: 115.2984
Fixed bandwidth: 42135.51 AICc value: 115.2519
Fixed bandwidth: 43730.21 AICc value: 115.2276
Fixed bandwidth: 44715.79 AICc value: 115.2139
Fixed bandwidth: 45324.92 AICc value: 115.206
Fixed bandwidth: 45701.37 AICc value: 115.2013
Fixed bandwidth: 45934.04 AICc value: 115.1984
Fixed bandwidth: 46077.83 AICc value: 115.1967
Fixed bandwidth: 46166.7 AICc value: 115.1956
Fixed bandwidth: 46221.62 AICc value: 115.195
Fixed bandwidth: 46255.57 AICc value: 115.1946
Fixed bandwidth: 46276.55 AICc value: 115.1943
Fixed bandwidth: 46289.51 AICc value: 115.1942
Fixed bandwidth: 46297.53 AICc value: 115.1941
Based on the output, the recommended fixed bandwith is 46297 metres.
The code chunk below records the results of the GWR model using a fixed bandwidth.
gwr_fixed <- gwr.basic(formula = gwrmodel1, data = UK_DF_sp,
bw = bw_fixed, kernel = "exponential",
longlat= FALSE, adaptive = TRUE)
gwr_fixed
***********************************************************************
* Package GWmodel *
***********************************************************************
Program starts at: 2021-04-12 01:40:04
Call:
gwr.basic(formula = gwrmodel1, data = UK_DF_sp, bw = bw_fixed,
kernel = "exponential", adaptive = TRUE, longlat = FALSE)
Dependent (y) variable: Crime_2017
Independent variables: Mean_Income_2017 Pop_sqkm_2017 Unem_WhiteUK_2017
Number of data points: 32
***********************************************************************
* Results of Global Regression *
***********************************************************************
Call:
lm(formula = formula, data = data)
Residuals:
Min 1Q Median 3Q Max
-3.0917 -0.5700 -0.1301 0.3153 4.6613
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.321e-01 6.441e-01 -0.981 0.334787
Mean_Income_2017 3.340e-05 8.747e-06 3.818 0.000683 ***
Pop_sqkm_2017 2.351e-04 6.767e-05 3.474 0.001687 **
Unem_WhiteUK_2017 1.775e-01 8.578e-02 2.069 0.047888 *
---Significance stars
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.287 on 28 degrees of freedom
Multiple R-squared: 0.6668
Adjusted R-squared: 0.6311
F-statistic: 18.68 on 3 and 28 DF, p-value: 7.486e-07
***Extra Diagnostic information
Residual sum of squares: 46.34713
Sigma(hat): 1.242942
AIC: 112.6656
AICc: 114.9733
BIC: 105.323
***********************************************************************
* Results of Geographically Weighted Regression *
***********************************************************************
*********************Model calibration information*********************
Kernel function: exponential
Adaptive bandwidth: 46297.53 (number of nearest neighbours)
Regression points: the same locations as observations are used.
Distance metric: Euclidean distance metric is used.
****************Summary of GWR coefficient estimates:******************
Min. 1st Qu. Median 3rd Qu.
Intercept -6.3233e-01 -6.3225e-01 -6.3220e-01 -6.3214e-01
Mean_Income_2017 3.3395e-05 3.3396e-05 3.3397e-05 3.3398e-05
Pop_sqkm_2017 2.3507e-04 2.3507e-04 2.3508e-04 2.3508e-04
Unem_WhiteUK_2017 1.7747e-01 1.7749e-01 1.7751e-01 1.7751e-01
Max.
Intercept -0.6321
Mean_Income_2017 0.0000
Pop_sqkm_2017 0.0002
Unem_WhiteUK_2017 0.1775
************************Diagnostic information*************************
Number of data points: 32
Effective number of parameters (2trace(S) - trace(S'S)): 4.001909
Effective degrees of freedom (n-2trace(S) + trace(S'S)): 27.99809
AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 114.9728
AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 106.6632
BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 84.52854
Residual sum of squares: 46.34231
R-square value: 0.6668111
Adjusted R-square value: 0.6174227
***********************************************************************
Program stops at: 2021-04-12 01:40:04
Based on the output, the Adjusted R squared is 0.617. The GWR model performs slightly poorer as compared to the multiple regression model which deliver an Adjusted R squared of 0.631.
Alternatively, the GWR model can be also built using adaptive bandwidth. Similar to calculating a fixed bandwidth, the kernel and approach remains the same whereas adaptive argument is put as “True”, as seen in the code chunk below:
bw_adapt <- bw.gwr (formula = gwrmodel1, data= UK_DF_sp,
approach = "AICc", kernel = "exponential",
adaptive = TRUE, longlat = FALSE)
Adaptive bandwidth (number of nearest neighbours): 27 AICc value: 113.8343
Adaptive bandwidth (number of nearest neighbours): 25 AICc value: 113.9467
Adaptive bandwidth (number of nearest neighbours): 30 AICc value: 114.1756
Adaptive bandwidth (number of nearest neighbours): 27 AICc value: 113.8343
Based on the output, 27 data points are recommended to be used.
The code chunk below records the results of the GWR model using an adaptive bandwidth.
gwr_adapt <- gwr.basic(formula = gwrmodel1, data = UK_DF_sp,
bw = bw_adapt, kernel = "exponential",
longlat= FALSE, adaptive = TRUE)
gwr_adapt
***********************************************************************
* Package GWmodel *
***********************************************************************
Program starts at: 2021-04-12 01:40:05
Call:
gwr.basic(formula = gwrmodel1, data = UK_DF_sp, bw = bw_adapt,
kernel = "exponential", adaptive = TRUE, longlat = FALSE)
Dependent (y) variable: Crime_2017
Independent variables: Mean_Income_2017 Pop_sqkm_2017 Unem_WhiteUK_2017
Number of data points: 32
***********************************************************************
* Results of Global Regression *
***********************************************************************
Call:
lm(formula = formula, data = data)
Residuals:
Min 1Q Median 3Q Max
-3.0917 -0.5700 -0.1301 0.3153 4.6613
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.321e-01 6.441e-01 -0.981 0.334787
Mean_Income_2017 3.340e-05 8.747e-06 3.818 0.000683 ***
Pop_sqkm_2017 2.351e-04 6.767e-05 3.474 0.001687 **
Unem_WhiteUK_2017 1.775e-01 8.578e-02 2.069 0.047888 *
---Significance stars
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.287 on 28 degrees of freedom
Multiple R-squared: 0.6668
Adjusted R-squared: 0.6311
F-statistic: 18.68 on 3 and 28 DF, p-value: 7.486e-07
***Extra Diagnostic information
Residual sum of squares: 46.34713
Sigma(hat): 1.242942
AIC: 112.6656
AICc: 114.9733
BIC: 105.323
***********************************************************************
* Results of Geographically Weighted Regression *
***********************************************************************
*********************Model calibration information*********************
Kernel function: exponential
Adaptive bandwidth: 27 (number of nearest neighbours)
Regression points: the same locations as observations are used.
Distance metric: Euclidean distance metric is used.
****************Summary of GWR coefficient estimates:******************
Min. 1st Qu. Median 3rd Qu.
Intercept -1.1963e+00 -9.4395e-01 -8.0321e-01 -6.7807e-01
Mean_Income_2017 2.6631e-05 3.0583e-05 3.2706e-05 3.4882e-05
Pop_sqkm_2017 2.2225e-04 2.3576e-04 2.4746e-04 2.5477e-04
Unem_WhiteUK_2017 1.4064e-01 1.8085e-01 2.0885e-01 2.2942e-01
Max.
Intercept -0.5077
Mean_Income_2017 0.0000
Pop_sqkm_2017 0.0003
Unem_WhiteUK_2017 0.2896
************************Diagnostic information*************************
Number of data points: 32
Effective number of parameters (2trace(S) - trace(S'S)): 7.932024
Effective degrees of freedom (n-2trace(S) + trace(S'S)): 24.06798
AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 113.8343
AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 100.7809
BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 83.96183
Residual sum of squares: 36.04842
R-square value: 0.7408214
Adjusted R-square value: 0.6517018
***********************************************************************
Program stops at: 2021-04-12 01:40:05
Based on the output, the Adjusted R squared is 0.651. The GWR model performs better as compared to the multiple regression model which deliver an Adjusted R squared of 0.631
The results of GWR include GW.arguments, GW.diagnostic, lm and a SpatialPolygonsDataFrame (SDF). The code chunk below covers the results into a sf data frame so as to visualise the fields in SDF.
gwr_adapt_output <- as.data.frame(gwr_adapt$SDF)
The data frame is combined with the main UK data frame to reflect the corresponding borough and geometry details.
crime_rate_adapt <- UK_DF_Geo %>%
select(NAME, GSS_CODE) %>%
cbind(.,gwr_adapt_output)
The code chunk below shows the building of a choropleth map with the Local R-squared results as the fill of the polygons. Tmap is used for interactivity.
gwr1Plot <- tm_shape(crime_rate_adapt) +
tm_fill("Local_R2", title = "R2 Values",
n = 5, style = "quantile", palette = "Blues") +
tm_borders(lwd = 1.2)
tmap_leaflet(gwr1Plot)
The code chunk below shows the building of a choropleth map with the Local R-squared results as the fill of the polygons. Tmap is used for interactivity.
gwr1residual <- tm_shape(crime_rate_adapt) +
tm_fill("residual", title = "Residual", n = 5,
style = "quantile", midpoint=NA, palette = "RdBu") +
tm_borders(lwd = 1.2)
tmap_leaflet(gwr1residual)
Comparing various regression models, GWR models are able to deliver a better adjusted R value if there are signs of spatial autocorrelation. In this report, the model uses 3 independent variables - the mean income, the population density and the unemployment rate of white UK citizens to explore the relationships between these variables and the outcome variable - Crime Rate.
The final visualization of the R shiny application will consists for 3 items - a Choropleth Map of the R2 values across the boroughs, a Choropleth Map of the residual values across all the 32 boroughs and an annotation of the adjusted R2 value of the GWR model as compared to a multiple variable regression model.

Gimond, M. (2019) A Basic Introduction to Moran’s I analysis in R. Retrieved from https://mgimond.github.io/simple_moransI_example/
Hunt, J. (2019, July 10). From Crime Mapping to Crime Forecasting: The Evolution of Place-Based Policing. National Institute of Justice. Retrieved from https://nij.ojp.gov/topics/articles/crime-mapping-crime-forecasting-evolution-place-based-policing#a-brief-history
Kam, T. S (2019, March 12). From Hands-On Exercise 9: Building Hedonic Pricing Model with GWR Method. Retrieved from https://rpubs.com/tskam/IS415-Hands-on_Ex09
Medina, J. & Solymosi, R. (2019, April 02). Crime Mapping in R. Retrieved from https://maczokni.github.io/crimemapping_textbook_bookdown/index.html
RAND Corporation. (2013). Predictive Policing: Forecasting Crime for Law Enforcement. Retrieved from https://www.rand.org/content/dam/rand/pubs/research_briefs/RB9700/RB9735/RAND_RB9735.pdf