Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MOXy JAXB marshals invalid control characters for Unicode (u+2019) when UTF-8 encoding is specified

I have encountered a very annoying error when trying to marshal a class to JSON using Eclipse Moxy.

I have a an attribute with the following value in one of my domain classes: "the City’s original city site" which contains the code point u+2019 (’)

When Jaxb attempts to marshal this value, I inexplicable get back a strange control: "Citys original city site"

This results in invalid JSON that returns a null value when decoded. I tried this with Jackson, and receive an ascii escape character, which is still wrong, but it at least makes for valid JSON!

Moxy should be able to output this correctly as ’ is a valid unicode character and is valid within JSON. Is there anything that I can do to output the ’ (and any other unicode character) correctly, and preferably converting this needless character to a regular apostrophe.

Here is my provider class:

@Provider
@Component("customMOXyJsonProvider")    
public class CustomMOXyJsonProvider extends MOXyJsonProvider {

    @Override
    protected void preWriteTo(Object object, Class<?> type, Type genericType,
                              Annotation[] annotations, MediaType mediaType,
                              MultivaluedMap<String, Object> httpHeaders, Marshaller marshaller)
            throws JAXBException {
        marshaller.setProperty(MarshallerProperties.JSON_INCLUDE_ROOT, true);
        marshaller.setProperty(Marshaller.JAXB_ENCODING,"UTF-8");
    }

}

I am using version 2.5.1 of Moxy.

    <dependency>
        <groupId>org.eclipse.persistence</groupId>
        <artifactId>org.eclipse.persistence.moxy</artifactId>
        <version>2.5.1</version>
    </dependency>

I have several components in my system that could theoretically screw up the value (postgres,jdbc,hibernate,cxf and tomcat), but I have determined through testing that the value is stored correctly in my domain class -and then corrupted, like Elliot Spitzer visitng a prostitute, at the marshaling step.

like image 851
THX1138 Avatar asked Oct 08 '13 20:10

THX1138


People also ask

Is UTF-8 Unicode?

UTF-8 is a Unicode character encoding method. This means that UTF-8 takes the code point for a given Unicode character and translates it into a string of binary. It also does the reverse, reading in binary digits and converting them back to characters.

What is 16-bit Unicode character?

Unicode uses two encoding forms: 8-bit and 16-bit, based on the data type of the data that is being that is being encoded. The default encoding form is 16-bit, where each character is 16 bits (2 bytes) wide. Sixteen-bit encoding form is usually shown as U+hhhh, where hhhh is the hexadecimal code point of the character.

What is UTF-8 data?

UTF-8 is a variable-width Unicode encoding that encodes each valid Unicode code point using one to four 8-bit bytes. UTF-8 has many desirable properties, including that it is backwards compatible with ASCII, often provides a more compact representation of Unicode data than UTF-16, and is endianness independent.

Is UTF-16 Unicode?

UTF-16 is an encoding of Unicode in which each character is composed of either one or two 16-bit elements. Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts.


1 Answers

Note: I'm the EclipseLink JAXB (MOXy) lead and a member of the JAXB (JSR-222) expert group.

UPDATE #3

The issue has now been fixed in the EclipseLink 2.5.2 and 2.6.0 streams. You will be able to download a nightly build starting October 10, 2013 from the following location:

  • http://www.eclipse.org/eclipselink/downloads/nightly.php

Or from Maven with

<dependency>
    <groupId>org.eclipse.persistence</groupId>
    <artifactId>org.eclipse.persistence.moxy</artifactId>
    <version>2.5.2-SNAPSHOT</version>
</dependency>

and

<repository>
    <id>oss.sonatype.org</id>
    <name>OSS Sonatype Staging</name>
    <url>https://oss.sonatype.org/content/groups/staging</url>
</repository>

UPDATE #2

The following bug can be used to track our progress on this issue:

  • http://bugs.eclipse.org/419072

UPDATE #1

You use case works in EclipseLink 2.5.0. A performance fix we made in EclipseLink 2.5.1 introduce the failure:

  • http://bugs.eclipse.org/404449

ORIGNAL ANSWER

There appears to be a bug in our marshalling to OutputStream that doesn't exist in our marshalling to Writer for JSON (XML works correctly). Below is what my quick investigation has uncovered. I will update my answer once I have more information.

Java Model

public class Foo {

    private String bar;

    public String getBar() {
        return bar;
    }

    public void setBar(String bar) {
        this.bar = bar;
    }

}

Demo Code

import java.io.OutputStreamWriter;
import java.util.*;
import javax.xml.bind.*;
import javax.xml.bind.Marshaller;
import org.eclipse.persistence.jaxb.JAXBContextProperties;

public class Demo {

    public static void main(String[] args) throws Exception {
        Map<String, Object> properties = new HashMap<String, Object>();
        properties.put(JAXBContextProperties.MEDIA_TYPE, "application/json");
        properties.put(JAXBContextProperties.JSON_INCLUDE_ROOT, false);
        JAXBContext jc = JAXBContext.newInstance(new Class[] {Foo.class}, properties);

        Foo foo = new Foo();
        foo.setBar("the City’s original city site");


        Marshaller marshaller = jc.createMarshaller();
        marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);

        // Broken
        marshaller.marshal(foo, System.out);

        // Works
        marshaller.marshal(foo, new OutputStreamWriter(System.out));
    }

}

Output

{
   "bar" : "the Citys original city site"
}{
   "bar" : "the City’s original city site"
}
like image 78
bdoughan Avatar answered Nov 15 '22 04:11

bdoughan