I'm developing a Portuguese software, so many of my entities have names like 'maça' or 'lição' and I want to use the entity as a resource key. So I want keep every character except the 'ç,ã,õ....'
There is some optimum solution using regex? My actual regex is (as Remove characters using Regex suggest):
Regex regex = new Regex(@"[\W_]+");
string cleanText = regex.Replace(messyText, "").ToUpper();
only to emphasize, I'm worried just with Latin characters.
To remove a character in an R data frame column, we can use gsub function which will replace the character with blank. For example, if we have a data frame called df that contains a character column say x which has a character ID in each value then it can be removed by using the command gsub("ID","",as.
If you are having a string with special characters and want's to remove/replace them then you can use regex for that. Use this code: Regex. Replace(your String, @"[^0-9a-zA-Z]+", "")
To get a string contains only letters (both uppercase or lowercase) we use a regular expression (/^[A-Za-z]+$/) which allows only letters.
I think the best regex would be to use:
[^\x00-\x80]
This is the negation of all ASCII characters. It matches all non-ASCII characters: The \x00
and \x80
(128) is the hexadecimal character code, and -
means range. The ^
inside the [
and ]
means negation.
Replace them with the empty string, and you should have what you want. It also frees you from worrying about punctuation, and the like, that are not ASCII, and can cause subtle but annoying (and hard to track down) errors.
If you want to use the extended ASCII set as legal characters, you can say \xFF
instead of \x80
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With