I have a string which has alphanumeric characters, special characters and non UTF-8 characters. I want to strip the special and non utf-8 characters.
Here's what I've tried:
gsub('[^0-9a-z\\s]','',"�+ Sample string here =�{�>E�BH�P<]�{�>")
However, This removes the special characters (punctuations + non utf8) but the output has no spaces.
gsub('/[^0-9a-z\\s]/i','',"�+ Sample string here =�{�>E�BH�P<]�{�>")
The result has spaces but there are still non utf8 characters present.
Any work around?
For the sample string above, output should be: Sample string here
The approach is to use the String. replaceAll method to replace all the non-alphanumeric characters with an empty string.
Alphanumeric characters by definition only comprise the letters A to Z and the digits 0 to 9. Spaces and underscores are usually considered punctuation characters, so no, they shouldn't be allowed.
Alphanumeric, also referred to as alphameric, is a term that encompasses all of the letters and numerals in a given language set. In layouts designed for English language users, alphanumeric characters are those comprised of the combined set of the 26 alphabetic characters, A to Z, and the 10 Arabic numerals, 0 to 9.
Python String isalnum() Method The isalnum() method returns True if all the characters are alphanumeric, meaning alphabet letter (a-z) and numbers (0-9). Example of characters that are not alphanumeric: (space)!
You could use the classes [:alnum:]
and [:space:]
for this:
sample_string <- "�+ Sample 2 string here =�{�>E�BH�P<]�{�>"
gsub("[^[:alnum:][:space:]]","",sample_string)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"
Alternatively you can use PCRE codes to refer to specific character sets:
gsub("[^\\p{L}0-9\\s]","",sample_string, perl = TRUE)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"
Both cases illustrate clearly that the characters still there, are considered letters. Also the EBHP inside are still letters, so the condition on which you're replacing is not correct. You don't want to keep all letters, you just want to keep A-Z, a-z and 0-9:
gsub("[^A-Za-z0-9 ]","",sample_string)
#> [1] " Sample 2 string here EBHP"
This still contains the EBHP. If you really just want to keep a section that contains only letters and numbers, you should use the reverse logic: select what you want and replace everything but that using backreferences:
gsub(".*?([A-Za-z0-9 ]+)\\s.*","\\1", sample_string)
#> [1] " Sample 2 string here "
Or, if you want to find a string, even not bound by spaces, use the word boundary \\b
instead:
gsub(".*?(\\b[A-Za-z0-9 ]+\\b).*","\\1", sample_string)
#> [1] "Sample 2 string here"
What happens here:
.*?
fits anything (.) at least 0 times (*) but ungreedy (?). This means that gsub will try to fit the smallest amount possible by this piece.()
will be stored and can be refered to in the replacement by \\1
\\b
indicates a word boundaryA-z
will include all special letters (which are UTF-8 btw!)\\1
in combination with .*
in the regex, will make sure only the required part remains in the output.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With