keep only alphanumeric characters and space in a string using gsub

Tags:

I have a string which has alphanumeric characters, special characters and non UTF-8 characters. I want to strip the special and non utf-8 characters.

Here's what I've tried:

gsub('[^0-9a-z\\s]','',"ï¿½+ Sample string here =ï¿½{ï¿½>Eï¿½BHï¿½P<]ï¿½{ï¿½>")

However, This removes the special characters (punctuations + non utf8) but the output has no spaces.

gsub('/[^0-9a-z\\s]/i','',"ï¿½+ Sample string here =ï¿½{ï¿½>Eï¿½BHï¿½P<]ï¿½{ï¿½>")

The result has spaces but there are still non utf8 characters present.

Any work around?

For the sample string above, output should be: Sample string here

720

asked Apr 08 '17 13:04

lilipunk

1 Answers

You could use the classes [:alnum:] and [:space:] for this:

sample_string <- "ï¿½+ Sample 2 string here =ï¿½{ï¿½>Eï¿½BHï¿½P<]ï¿½{ï¿½>"
gsub("[^[:alnum:][:space:]]","",sample_string)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"

Alternatively you can use PCRE codes to refer to specific character sets:

gsub("[^\\p{L}0-9\\s]","",sample_string, perl = TRUE)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"

Both cases illustrate clearly that the characters still there, are considered letters. Also the EBHP inside are still letters, so the condition on which you're replacing is not correct. You don't want to keep all letters, you just want to keep A-Z, a-z and 0-9:

gsub("[^A-Za-z0-9 ]","",sample_string)
#> [1] " Sample 2 string here EBHP"

This still contains the EBHP. If you really just want to keep a section that contains only letters and numbers, you should use the reverse logic: select what you want and replace everything but that using backreferences:

gsub(".*?([A-Za-z0-9 ]+)\\s.*","\\1", sample_string)
#> [1] " Sample 2 string here "

Or, if you want to find a string, even not bound by spaces, use the word boundary \\b instead:

gsub(".*?(\\b[A-Za-z0-9 ]+\\b).*","\\1", sample_string)
#> [1] "Sample 2 string here"

What happens here:

.*? fits anything (.) at least 0 times (*) but ungreedy (?). This means that gsub will try to fit the smallest amount possible by this piece.
everything between () will be stored and can be refered to in the replacement by \\1
\\b indicates a word boundary
This is followed at least once (+) by any character that's A-Z, a-z, 0-9 or a space. You have to do it that way, because the special letters are contained in between the upper and lowercase in the code table. So using A-z will include all special letters (which are UTF-8 btw!)
after that sequence,fit anything at least zero times to remove the rest of the string.
the backreference \\1 in combination with .* in the regex, will make sure only the required part remains in the output.

165

answered Oct 24 '22 14:10

Joris Meys

Related questions
                            
                                Labelling ggdendro leaves in multiple colors
                            
                                R Subset XTS weekdays
                            
                                Python-like unpacking of numeric value in R [duplicate]
                            
                                Using variable value as column name in data.frame or cbind
                            
                                How to speed up GLM estimation?
                            
                                Class of data.table column
                            
                                Generate all possible n choose 2 pairs from a vector in R, efficient and fast [duplicate]
                            
                                R extract first number from string
                            
                                Calculate average monthly total by groups from data.table in R
                            
                                Automatically escape unicode characters
                            
                                Use scientific notation with xtable in R
                            
                                foreach: Keep names
                            
                                Tables and Figures side-by-side in Knitr or RMarkdown Beamer
                            
                                R Markdown: How do I show file contents
                            
                                Using lapply to change column names of a list of data frames
                            
                                R : ggplot2 : facet_grid : how include math expressions in few (not all) labels?
                            
                                Direct link to tabItem with R shiny dashboard
                            
                                Vertical spaces in legend
                            
                                Reduce file size of R Markdown HTML output
                            
                                Using R to read out excel-colorinfo

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

keep only alphanumeric characters and space in a string using gsub

Tags:

string

regex

r

utf-8

gsub

lilipunk

People also ask

1 Answers

Joris Meys

Recent Activity

Donate For Us