Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google App Engine DataStore Text UTF-8 Encoding Problem

I'm building a gwt app that stores the text of random webpages in a datastore text field. Often the text is formatted UTF-8. All the files of my app are stored as UTF-8 and when I run the application on my local machine the entire process works fine. UTF-8 text is stored as such and retrievable ftom the local version of the app engine as UTF-8. However when I deploy the app to the google app engine somewhere between when I store the text and when I retrieve it it is no longer UTF-8 which causes non-ascii characters to be displayed as ?.

When I view the datastore in the appengine control panel all the special characters appear as ? which leads me to believe that it is a problem when writing to the database.

Does anyone know how to fix this?

The app itself is a little big. Here's some pseudocode:

Text webPageText = new Text(<STRING THAT CONTAINS UNICODE CHARACTERS>);

/*Some Code to store Text object on datastore
Specifically I'm using javax.jdo.PersistenceManager to do this.
Some Code to retrieve text from datastore. */

String retrievedText = webPageText.getValue();

The problem is that retrievedText comes back with ? instead of unicode characters.

Here's a similar problem in python that I found: Trying to store Utf-8 data in datastore getting UnicodeEncodeError. Though my app is not getting any errors.

Unfortunately I think Java strings are default utf-8 and I can't find any code that will let me declare them explicitly as utf-8.

Edit: I've now built a small webapp that takes in unicode text and stores it in the datastore and then retrieves it with no problems. I still have no idea where the problem is in my original source code but I'm going to change the way my code handles webpage retrieval to match the smaller app that I just built. Thank you everyone for your help.

like image 852
Richard Wallis Avatar asked Jun 29 '10 22:06

Richard Wallis


2 Answers

Fixed same issue by setting both request and response encoding to utf-8. Request encoding results in valid string stored in datastore, without it values will be stored as "????..."

Requests: if you use Apache HTTP Client, this is done in the following way:

Get request:

NameValuePair... params;
...
String url = urlBase + URLEncodedUtils.format(Arrays.asList(params), "UTF-8");
HttpGet httpGet = new HttpGet(url);

Post request:

NameValuePair... params;
...
HttpPost httpPost = new HttpPost(url);
httpPost.setEntity(new UrlEncodedFormEntity(Arrays.asList(params), "UTF-8"));

Response: if you build your response in HttpServlet, this is done in a following way:

HttpServletResponse resp;
...
resp.setContentType("text/html; charset=utf-8");
like image 123
Better Living Guy Avatar answered Sep 25 '22 16:09

Better Living Guy


I tried to convert String to ByteArray and then store it as datastore blob.

//Save String as Blob
Blob webPageText = new Blob(<STRING THAT CONTAINS UNICODE CHARACTERS>.getBytes());

//Retrieve Blob as String
String retrievedText = new String(webPageText.getBytes());

I originally thought this had solved the problem but I had by mistake only tested it on my local server. This code still returns ? instead of unicode characters which leads me to believe that the problem isn't in the datastore but in the transfer from the app engine to the client.

like image 23
Richard Wallis Avatar answered Sep 24 '22 16:09

Richard Wallis