Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R ggplot2 merge with shapefile and csv data to fill polygons - specific

I know that this question has been answered elsewhere already. I tried to follow the instructions of @jlhoward but obviously my skills are too limited. May I ask again of your help, R community?

Here's what I have:

A shapefile for Switzerland: Link

And the corresponding CSV-file for the names of municipalities as well as their zip-codes: Link

Website of data: cadastre.ch

Additional data on the last popular vote: Direct link, excel-file

I add a column to the CSV-file (wow.csv) (the data I want to illustrate) by merging. The file looks now like this:

Gemeinden   code    Ja.Anteil   Ortschaft       PLZ Zusatzziffer    Kantonskürzel   E   N
1   Aadorf  4551    78.78638    Aawangen        8522    2           TG            710206    263564
2   Aadorf  4551    78.78638    Ettenhausen TG  8356    0           TG            710129    259411
3   Aadorf  4551    78.78638    Aadorf          8355    0           TG            710588    261648
4   Aadorf  4551    78.78638    Guntershausen   8357    0           TG            711741    258934
5   Aadorf  4551    78.78638    Wittenwil       9547    0           TG            712002    262572

Afterwards I tried to follow the instructions of @jlhoward:

  1. Import temperature data file
  2. Import polygon shapefile of municipalities
  3. Convert muni polygons to a data frame for plotting
  4. Join columns from ch12@data to ch12.df
  5. Join columns from wow to ch12.df
  6. Make the plot

I tried it with the following code:

require("rgdal")
require("maptools")
require("ggplot2")
require("plyr")

# read share data and the file from cadastre.ch (zip-codes)
asyl <- <- read.csv("~/FS14-1/PLZO_SHP_LV03/Asylgesetz_csv.csv", sep=";")
mydata1 <- read.table("~/FS14-1/PLZO_SHP_LV03/PLZO_CSV_LV03.csv", sep=";", quote="\"")

#merge the two files
wow <- merge(x = asyl, y = mydata1, by = "Gemeinden", all = TRUE)

# read municipality polygons
ch12 <- readOGR(work.dir, layer = "PLZO_PLZ")

# fortify and merge: muni.df is used in ggplot
ch12@data$id <- rownames(ch12@data)
ch12.df <- fortify(ch12)
ch12.df <- join(ch12.df, ch12@data, by="id")
ch12.df <- merge(ch12.df, wow, by="PLZ", all.x=T, a..ly=F)

#create the map layers
gp <- ggplot(data=ch12.df, aes(x=long, y=lat, group=group)) + 
      geom_polygon(aes(group = group))+         # draw polygons
      geom_path(color="grey", linestyle=2)+  # draw boundaries
      coord_equal() +
      scale_fill_gradient(low = "#ffffcc", high = "#ff4444", 
                             space = "Lab", na.value = "grey50",
                             guide = "colourbar")+
      labs(title="Zustimmung auf Gemeindelevel")
gp

Well, until the last step, R worked (so far I can tell) but if I try to create the ggplot, I get no an error and R somehow terminates. What I'm trying to achieve is to control the colours (in my case of municipalities) of the polygons depending on data...

Can anyone help me?

like image 694
Thomas Avatar asked Feb 27 '14 10:02

Thomas


1 Answers

It is a bit unclear what you are trying to do, so I needed to make a few assumptions.

  1. Your Excel file has provisional voting results on a national initiative against mass immigration, by Gemeinden (municipality). So I assume you want a choropleth map showing, e.g., % YES. In what follows I cleaned up the Excel file a bit by extracting the 2353 rows containing the actual vote results (rows 13 - 2365), adding a header (from row 11), and saving as vote.csv. The column Ja in % was renamed Yes.Pct.
  2. Your shapefile is organized (more or less) by postal code (PLZ). This creates enormous problems, for a variety of reasons explained later. So I poked around a bit on Google and almost immediately found a shapefile organized by municipality at the Swiss Federal Office of Topography. You have to "order" the file, here, but it's free - so basically all you need to do is register and they email you a link to the files. The specific file-set I used was VEC200_Commune.*
  3. One advantage of the file from the Office of Topography is that it has municipality numbers (roughly equivalent to FIPS codes used by the US Federal Government). Your Excel file also has these numbers (called Gemeinde-Nr. in the Excel file, and BFSNR in the shapefile). Matching based on these id's is much more reliable than trying to match using names.

So putting all this together yields the following map: from this code:

library(plyr)    # for join(...)
library(rgdal)   # for readOGR(...)
library(ggplot2)

setwd("< directory with all files >")
votes <- read.csv("vote.csv")
map   <- readOGR(dsn=".",layer="VEC200_Commune")
map   <- map[map$COUNTRY=="CH",]   # extract just Swiss Gemeinde

data <- data.frame(id=rownames(map@data), 
                   GEMNAME=map@data$GEMNAME,
                   BFSNR=map@data$BFSNR)
# convert id to char from factor
data$id <- as.character(data$id)
# merge vote data based on Gemeinden (different columns names in the two df...)
data    <- merge(data,votes,by.x="BFSNR",by.y="Gemeinde.Nr.", all.x=T)

map.df <- fortify(map)
map.df <- join(map.df,data,by="id")
ggplot(map.df, aes(long,lat, group=group))+
  geom_polygon(aes(fill=Yes.Pct))+
  coord_fixed()+
  scale_fill_gradient(low = "#ffffcc", high = "#ff4444", 
                      space = "Lab", na.value = "grey80",
                      guide = "colourbar")+
  labs(title="Zustimmung auf Gemeindelevel", x="", y="")+
  theme(axis.text=element_blank(),axis.ticks=element_blank())

It is possible to use your shapefile (based on postal codes), but this adds complexity and may not be reliable. There are several reasons:

  1. Your shapefile has 4175 polygons but only 3201 unique postal codes. This means that many of the PLZ are duplicated (or worse). The same is true of your PLZO_CSV_LV03.csv: the PLZ are not unique. This is a problem when you merge on PLZ. Consider as an example that you merge two dataframes X and Y based on a common column PLZ. If X has, say 5 rows with a given PLZ and Y has 3 rows with that same PLZ, the result will have 15 rows with that PLZ.
  2. It turns out that merging on both PLZ and Zusatzziffer improved the situation, but does not completely eliminate duplication (that is, even when considering PLC and Zusatzziffer in combination, there is some duplication).
  3. None of the PLZ containing files had the Gemeinde-Nr., so the only option was to merge the voting data based on Gemeindename. This is risky, because often the names are not spelled exactly the same in different sources.
  4. The PLZ shapefile is very large, partly due to the number of polygons (4175) and partly due to the spatial resolution (e.g. more points per polygon). As a result, fortify(...) is extremely slow, and even rendering the map itself is slow. This may be why your R session crashed.

With all these caveats in mind, it is possible to produce the following map, using your PLZ-level shapefile: with this code:

votes    <- read.csv("vote.csv")
zipcodes <- read.csv(sep=";","PLZO_CSV_LV03.csv")
ch12 <- readOGR(dsn=".",layer="PLZO_PLZ")
# associate id, PLZ, and Zusatzziffer
data <- data.frame(id=rownames(ch12@data), 
                   PLZ=ch12@data$PLZ, 
                   Zusatzziffer=ch12@data$ZUSZIFF)
# convert id to char from factor
data$id <- as.character(data$id)
# need to merge based on PLZ *and* Zusatzziffer
data    <- merge(data,zipcodes[2:4],by=c("PLZ","Zusatzziffer"), all.x=T)
# merge vote data based on Gemeinden (different columns names in the two df...)
data    <- merge(data,votes,by.x="Gemeindename",by.y="Gemeinden", all.x=T)

ch12.df <- fortify(ch12) 
# join data to ch12.df based in id
ch12.df <- join(ch12.df, data, by="id")

gp <- ggplot(data=ch12.df, aes(x=long, y=lat, group=group)) + 
  geom_polygon(aes(fill = Yes.Pct))+    # draw polygons
  coord_equal() +
  scale_fill_gradient(low = "#ffffcc", high = "#ff4444", 
                      space = "Lab", na.value = "grey80",
                      guide = "colourbar")+
  labs(title="Zustimmung auf Gemeindelevel", x="", y="")+
  theme(axis.text=element_blank(),axis.ticks=element_blank())
gp

Note that the two maps are similar, but not quite exactly the same. I am inclined to trust the first one, because it is uses Gemeine Nr. instead of names, and because it involves fewer merges.

like image 161
jlhoward Avatar answered Nov 10 '22 23:11

jlhoward