i have a VB.NET program that handles the content of documents. The programm handles high volumes of documents as "batch"(>2Million documents;total 1TB volume) Some of this documents may contain control chars or chars like f0e8(http://www.fileformat.info/info/unicode/char/f0e8/browsertest.htm).
Is there a easy and especially fast way to remove that chars?(except space,newline,tab,...) If the answer is regex: Has anyone a complete regex for me?
Thanks!
Try
resultString = Regex.Replace(subjectString, "\p{C}+", "");
This will remove all "other" Unicode characters (control, format, private use, surrogate, and unassigned) from your string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With