This is my sample dataset:
Name <- c("apple firm","苹果 firm","Ãpple firm")
Rank <- c(1,2,3)
data <- data.frame(Name,Rank)
I would like to delete the Name containing non-English character. For this sample, only "apple firm" should stay.
I tried to use the tm
package, but it can only help me delete the non-english characters instead of the whole queries.
I would check out this related Stack Overflow post for doing the same thing in javascript. Regular expression to match non-English characters?
To translate this into R, you could do (to match non-ASCII):
res <- data[which(!grepl("[^\x01-\x7F]+", data$Name)),]
res
# A tibble: 1 × 2
# Name Rank
# <chr> <dbl>
#1 apple firm 1
And to match non-unicode per that same SO post:
res <- data[which(!grepl("[^\u0001-\u007F]+", data$Name)),]
res
# A tibble: 1 × 2
# Name Rank
# <chr> <dbl>
#1 apple firm 1
Note - we had to take out the NUL
character for this to work. So instead of starting at \u0000
or x00
we start at \u0001
and \x01
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With