Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GAM with mrf smooth - errors (mismatch between nb/polys area names and data area names

Tags:

r

mgcv

spdep

I am trying to fit Polish local gov election results 2015 following the supperb blog by @GavinSimpson. https://www.fromthebottomoftheheap.net/2017/10/19/first-steps-with-mrf-smooths/ I join xls with shp data on 6 digit identifier (there may be a leading 0 s). I kept it a text variable. EDIT, I simplified the identifier and am now using a sequence from 1 to nrow to simplify my question.

library(tidyverse)
library(sf)
library(mgcv)

# Read data
# From https://www.gis-support.pl/downloads/gminy.zip shp file

boroughs_shp <- st_read("../../_mapy/gminy.shp",options = "ENCODING=WINDOWS-1250",
                     stringsAsFactors = FALSE ) %>% 
  st_transform(crs = 4326)%>% 
  janitor::clean_names() %>% 
# st_simplify(preserveTopology = T, dTolerance = 0.01) %>% 
  mutate(teryt=str_sub(jpt_kod_je, 1, 6)) %>% 
  select(teryt, nazwa=jpt_nazwa, geometry)

# From https://parlament2015.pkw.gov.pl/wyniki_zb/2015-gl-lis-gm.zip data file
elections_xls <-
  readxl::read_excel("data/2015-gl-lis-gm.xls",
             trim_ws = T, col_names = T) %>% 
  janitor::clean_names() %>% 
  select(teryt, liczba_wyborcow, glosy_niewazne)

elections <-
  boroughs_shp %>% fortify() %>% 
  left_join(elections_xls, by = "teryt") %>% 
  arrange(teryt) %>%
  mutate(idx = seq.int(nrow(.)) %>% as.factor(), 
         teryt = as.factor(teryt)) 

# Neighbors

boroughs_nb <-spdep::poly2nb(elections, snap = 0.01, queen = F, row.names = elections$idx )
names(boroughs_nb) <- attr(boroughs_nb, "region.id")

# Model

ctrl <- gam.control(nthreads = 4) 
m1 <- gam(glosy_niewazne ~ s(idx, bs = 'mrf', xt = list(nb = boroughs_nb)), 
          data = elections,
          offset = log(liczba_wyborcow), # number of votes
          method = 'REML', 
          control = ctrl,
          family = betar()) 

Here is the error message:

    Error in smooth.construct.mrf.smooth.spec(object, dk$data, dk$knots) : 
  mismatch between nb/polys supplied area names and data area names
In addition: Warning message:
In if (all.equal(sort(a.name), sort(levels(k))) != TRUE) stop("mismatch between nb/polys supplied area names and data area names") :
  the condition has length > 1 and only the first element will be used

elections$idx is a factor. I am using it to give names to boroughs_nb to be absolutely sure I have the same number of levels. What am I doing wrong?

EDIT: The condition mentioned in error message is met:

> all(sort(names(boroughs_nb)) == sort(levels(elections$idx)))
[1] TRUE
like image 651
Jacek Kotowski Avatar asked Jun 11 '19 11:06

Jacek Kotowski


1 Answers

It seems that I solved the issue, maybe not quite realizing how it did being stat beginner.

First, not a single NA should be present in modeled data. There was one. After that the mcgv seemed to run, but it took long time (quarter of an hour) and inexplicably for me, only when I limited no of knots to k=50, with poor results (less or more and it did not return any result) and with warning to be cautious about results. Then I tried to remove offset=log(liczba_wyborcow) ie offset number of voters and made number of void votes per 1000 my predicted variable.

elections <-
 boroughs_shp %>%  
 left_join(elections_xls, by = "teryt") %>% na.omit() %>% 
 arrange(teryt) %>% 
 mutate(idx = row_number() %>% as.factor()) %>% 
 mutate(void_ratio=round(glosy_niewazne/liczba_wyborcow,3)*1000)

Now that it is a count, why not try change family = betar() in gam formula to poisson() - still not a good result, and then to negative binomial family = nb() Now my formula looks like

m1 <-
gam(
 void_ratio ~ s(
 idx,
 bs = 'mrf',
 k =500,
 xt = list(nb = boroughs_nb),
 fx = TRUE),
 data = elections_df,
 method = 'REML', 
 control = gam.control(nthreads = 4),
 family = nb()
)

It seems now to be blazingly fast and return valid results with no warnings or errors. On a laptop with 4 cores Intel Core I7 6820HQ @ 2.70GHZ 16GB Win10 it takes now 1-2 minutest to build a model.

In brief, what I changed was: remove a single NA, remove offset from formula and use negative binomial distribution.

Here is the result of what I wanted to achieve, from left to right, real rate of void votes, a rate smoothed by a model and residuals indicating discrepancies. The mcgv code let me do that.

expected result

like image 180
Jacek Kotowski Avatar answered Sep 29 '22 11:09

Jacek Kotowski