Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

keep only alphanumeric characters and space in a string using gsub

I have a string which has alphanumeric characters, special characters and non UTF-8 characters. I want to strip the special and non utf-8 characters.

Here's what I've tried:

gsub('[^0-9a-z\\s]','',"�+ Sample string here =�{�>E�BH�P<]�{�>")

However, This removes the special characters (punctuations + non utf8) but the output has no spaces.

gsub('/[^0-9a-z\\s]/i','',"�+ Sample string here =�{�>E�BH�P<]�{�>")

The result has spaces but there are still non utf8 characters present.

Any work around?

For the sample string above, output should be: Sample string here

like image 720
lilipunk Avatar asked Apr 08 '17 13:04

lilipunk


People also ask

How do you remove everything except alphanumeric characters from a string?

The approach is to use the String. replaceAll method to replace all the non-alphanumeric characters with an empty string.

Can alphanumeric have space?

Alphanumeric characters by definition only comprise the letters A to Z and the digits 0 to 9. Spaces and underscores are usually considered punctuation characters, so no, they shouldn't be allowed.

What is alphanumeric characters and spaces?

Alphanumeric, also referred to as alphameric, is a term that encompasses all of the letters and numerals in a given language set. In layouts designed for English language users, alphanumeric characters are those comprised of the combined set of the 26 alphabetic characters, A to Z, and the 10 Arabic numerals, 0 to 9.

Is alphanumeric with space Python?

Python String isalnum() Method The isalnum() method returns True if all the characters are alphanumeric, meaning alphabet letter (a-z) and numbers (0-9). Example of characters that are not alphanumeric: (space)!


1 Answers

You could use the classes [:alnum:] and [:space:] for this:

sample_string <- "�+ Sample 2 string here =�{�>E�BH�P<]�{�>"
gsub("[^[:alnum:][:space:]]","",sample_string)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"

Alternatively you can use PCRE codes to refer to specific character sets:

gsub("[^\\p{L}0-9\\s]","",sample_string, perl = TRUE)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"

Both cases illustrate clearly that the characters still there, are considered letters. Also the EBHP inside are still letters, so the condition on which you're replacing is not correct. You don't want to keep all letters, you just want to keep A-Z, a-z and 0-9:

gsub("[^A-Za-z0-9 ]","",sample_string)
#> [1] " Sample 2 string here EBHP"

This still contains the EBHP. If you really just want to keep a section that contains only letters and numbers, you should use the reverse logic: select what you want and replace everything but that using backreferences:

gsub(".*?([A-Za-z0-9 ]+)\\s.*","\\1", sample_string)
#> [1] " Sample 2 string here "

Or, if you want to find a string, even not bound by spaces, use the word boundary \\b instead:

gsub(".*?(\\b[A-Za-z0-9 ]+\\b).*","\\1", sample_string)
#> [1] "Sample 2 string here"

What happens here:

  • .*? fits anything (.) at least 0 times (*) but ungreedy (?). This means that gsub will try to fit the smallest amount possible by this piece.
  • everything between () will be stored and can be refered to in the replacement by \\1
  • \\b indicates a word boundary
  • This is followed at least once (+) by any character that's A-Z, a-z, 0-9 or a space. You have to do it that way, because the special letters are contained in between the upper and lowercase in the code table. So using A-z will include all special letters (which are UTF-8 btw!)
  • after that sequence,fit anything at least zero times to remove the rest of the string.
  • the backreference \\1 in combination with .* in the regex, will make sure only the required part remains in the output.
like image 165
Joris Meys Avatar answered Oct 24 '22 14:10

Joris Meys