I'm doing data cleaning. I use mutate in Dplyr a lot since it generates new columns step by step and I can easily see how it goes.
Here are two examples where I have this error
Error: incompatible size (%d), expecting %d (the group size) or 1
Example 1: Get town name from zipcode. Data is simply like this:
Zip
1 02345
2 02201
And I notice when the data has NA in it, it doesn't work.
Without NA it works:
library(dplyr)
library(zipcode)
data(zipcode)
test = data.frame(Zip=c('02345','02201'),stringsAsFactors=FALSE)
test %>%
rowwise() %>%
mutate( Town1 = zipcode[zipcode$zip==na.omit(Zip),'city'] )
resulting in
Source: local data frame [2 x 2]
Groups: <by row>
Zip Town1
1 02345 Manomet
2 02201 Boston
With NA it doesn't work:
library(dplyr)
library(zipcode)
data(zipcode)
test = data.frame(Zip=c('02345','02201',NA),stringsAsFactors=FALSE)
test %>%
rowwise() %>%
mutate( Town1 = zipcode[zipcode$zip==na.omit(Zip),'city'] )
resulting in
Error: incompatible size (%d), expecting %d (the group size) or 1
Example2. I wanna get rid of the redundant state name that occurs in the Town column in the following data.
Town State
1 BOSTON MA MA
2 NORTH AMAMS MA
3 CHICAGO IL IL
This is how I do it: (1) split the string in Town into words, e.g. 'BOSTON' and 'MA' for row 1. (2) see if any of these words match the State of that line (3) delete the matched words
library(dplyr)
test = data.frame(Town=c('BOSTON MA','NORTH AMAMS','CHICAGO IL'), State=c('MA','MA','IL'), stringsAsFactors=FALSE)
test %>%
mutate(Town.word = strsplit(Town, split=' ')) %>%
rowwise() %>% # rowwise ensures every calculation only consider currect row
mutate(is.state = match(State,Town.word ) ) %>%
mutate(Town1 = Town.word[-is.state])
This results in:
Town State Town.word is.state Town1
1 BOSTON MA MA <chr[2]> 2 BOSTON
2 NORTH AMAMS MA <chr[2]> NA NA
3 CHICAGO IL IL <chr[2]> 2 CHICAGO
Meaning: E.g., row 1 shows is.state==2, meaning the 2nd word in Town is the state name. After getting rid of that work, Town1 is the correct town name.
Now I wanna fix the NA in row 2, but add na.omit would cause error:
test %>%
mutate(Town.word = strsplit(Town, split=' ')) %>%
rowwise() %>% # rowwise ensures every calculation only consider currect row
mutate(is.state = match(State,Town.word ) ) %>%
mutate(Town1 = Town.word[-na.omit(is.state)])
results in:
Error: incompatible size (%d), expecting %d (the group size) or 1
I checked the data type and size:
test %>%
mutate(Town.word = strsplit(Town, split=' ')) %>%
rowwise() %>% # rowwise ensures every calculation only consider currect row
mutate(is.state = match(State,Town.word ) ) %>%
mutate(length(is.state) ) %>%
mutate(class(na.omit(is.state)))
results in:
Town State Town.word is.state length(is.state) class(na.omit(is.state))
1 BOSTON MA MA <chr[2]> 2 1 integer
2 NORTH AMAMS MA <chr[2]> NA 1 integer
3 CHICAGO IL IL <chr[2]> 2 1 integer
So it is %d of length==1. Can somebody where's wrong? Thanks
Can you just sub
it out?
test %>%
rowwise() %>%
mutate(Town=sub(sprintf('[, ]*%s$', State), '', Town))
## Source: local data frame [3 x 2]
## Groups: <by row>
##
## Town State
## 1 BOSTON MA
## 2 NORTH AMAMS MA
## 3 CHICAGO IL
(This way also catches commas after the town, if that happens.)
NB: if you use ungroup()
here with a rowwise_df
(as this is), it will wipe the tbl_df
class as well and output a straight data.frame, which is fine for your data but will clobber your screen if you aren't careful and are looking at large amounts of data (as I've done countless times). (Github references #936 and #553.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With