In my C# code, I am extracting text from a PDF document. When I do that, I get a string that's in UTF-8 or Unicode encoding (I'm not sure which). When I use Encoding.UTF8.GetBytes(src);
to convert it into a byte array, I notice that the whitespace is actually two characters with byte values of 194 and 160.
For example the string "CLE action" looks like
[67, 76, 69, 194 ,160, 65 ,99, 116, 105, 111, 110]
in a byte array, where the whitespace is 194 and 160... And because of this src.IndexOf("CLE action");
is returning -1 when I need it to return 1.
How can I fix the encoding of the string?
Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit. UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.
UTF-16 is only more efficient than UTF-8 on some non-English websites. If a website uses a language with characters farther back in the Unicode library, UTF-8 will encode all characters as four bytes, whereas UTF-16 might encode many of the same characters as only two bytes.
194 160
is the UTF-8 encoding of a NO-BREAK SPACE
codepoint (the same codepoint that HTML calls
).
So it's really not a space, even though it looks like one. (You'll see it won't word-wrap, for instance.) A regular expression match for \s
would match it, but a plain comparison with a space won't.
To simply replace NO-BREAK spaces you can do the following:
src = src.Replace('\u00A0', ' ');
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With