Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

remove non-UTF-8 characters from xml with declared encoding=utf-8 - Java

Tags:

I have to handle this scenario in Java:

I'm getting a request in XML form from a client with declared encoding=utf-8. Unfortunately it may contain not utf-8 characters and there is a requirement to remove these characters from the xml on my side (legacy).

Let's consider an example where this invalid XML contains £ (pound).

1) I get xml as java String with £ in it (I don't have access to interface right now, but I probably get xml as a java String). Can I use replaceAll(£, "") to get rid of this character? Any potential issues?

2) I get xml as an array of bytes - how to handle this operation safely in that case?

like image 360
St Nietzke Avatar asked May 19 '10 20:05

St Nietzke


People also ask

How do I remove a non UTF-8 character from a string in Java?

You can get rid of anything outside the printable ASCII range using the following regex: string = string. replaceAll("[^\\x20-\\x7e]", ""); 2) I get xml as an array of bytes - how to handle this operation safely in that case?

What is a non UTF-8 character?

Non-UTF-8 characters are characters that are not supported by UTF-8 encoding and, they may include symbols or characters from foreign unsupported languages.

How do I convert a string to UTF-8 in Java?

In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.

Does Java uses UTF-8 as its internal character representation?

Java uses UTF-16 for the internal text representation and supports a non-standard modification of UTF-8 for string serialization.


2 Answers

1) I get xml as java String with £ in it (I don't have access to interface right now, but I probably get xml as a java String). Can I use replaceAll(£, "") to get rid of this character?

I am assuming that you rather mean that you want to get rid of non-ASCII characters, because you're talking about a "legacy" side. You can get rid of anything outside the printable ASCII range using the following regex:

string = string.replaceAll("[^\\x20-\\x7e]", ""); 

2) I get xml as an array of bytes - how to handle this operation safely in that case?

You need to wrap the byte[] in an ByteArrayInputStream, so that you can read them in an UTF-8 encoded character stream using InputStreamReader wherein you specify the encoding and then use a BufferedReader to read it line by line.

E.g.

BufferedReader reader = null; try {     reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(bytes), "UTF-8"));     for (String line; (line = reader.readLine()) != null;) {         line = line.replaceAll("[^\\x20-\\x7e]", "");         // ...     }     // ... 
like image 155
BalusC Avatar answered Sep 19 '22 13:09

BalusC


UTF-8 is an encoding; Unicode is a character set. But the GBP symbol is most definitely in the Unicode character set and therefore most certainly representable in UTF-8.

If you do in fact mean UTF-8, and you are actually trying to remove byte sequences that are not the valid encoding of a character in UTF-8, then...

CharsetDecoder utf8Decoder = Charset.forName("UTF-8").newDecoder(); utf8Decoder.onMalformedInput(CodingErrorAction.IGNORE); utf8Decoder.onUnmappableCharacter(CodingErrorAction.IGNORE); ByteBuffer bytes = ...; CharBuffer parsed = utf8Decoder.decode(bytes); ... 
like image 23
Sean Owen Avatar answered Sep 17 '22 13:09

Sean Owen