Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

setting a UTF-8 in java and csv file [duplicate]

I am using this code for add Persian words to a csv file via OpenCSV:

String[] entries="\u0645 \u062E\u062F\u0627".split("#"); try{     CSVWriter writer=new CSVWriter(new OutputStreamWriter(new FileOutputStream("C:\\test.csv"), "UTF-8"));      writer.writeNext(entries);     writer.close(); } catch(IOException ioe){     ioe.printStackTrace(); } 

When I open the resulting csv file, in Excel, it contains "ứỶờịỆ". Other programs such as notepad.exe don't have this problem, but all of my users are using MS Excel.

Replacing OpenCSV with SuperCSV does not solve this problem.

When I typed Persian characters into csv file manually, I don't have any problems.

like image 754
mehdi Avatar asked Nov 16 '10 08:11

mehdi


People also ask

What is UTF-8 encoding for a CSV?

UTF-8, or "Unicode Transformation Format, 8 Bit" is a marketing operations pro's best friend when it comes to data imports and exports. It refers to how a file's character data is encoded when moving files between systems.

What is the difference between CSV and CSV UTF-8?

CSV UTF-8 (comma delimited).This format is recommended for files that contain any non-ASCII characters since the classic CSV format destroys them. Besides CSV, there is one more format that may come in extremely handy for communicating with other programs.

How do I know if a CSV file is UTF-8?

On Windows computers - the easiest way to do this is as follows: Open the file using Notepad. Click "File > Save As". In the dialog window that appears - select "UTF-8" from the "Encoding" field.


2 Answers

I spent some time but found solution for your problem.

First I opened notepad and wrote the following line: שלום, hello, привет Then I saved it as file he-en-ru.csv using UTF-8. Then I opened it with MS excel and everything worked well.

Now, I wrote a simple java program that prints this line to file as following:

    PrintWriter w = new PrintWriter(new OutputStreamWriter(os, "UTF-8"));     w.print(line);     w.flush();     w.close(); 

When I opened this file using excel I saw "gibrish."

Then I tried to read content of 2 files and (as expected) saw that file generated by notepad contains 3 bytes prefix:

    239 EF     187 BB     191 BF 

So, I modified my code to print this prefix first and the text after that:

    String line = "שלום, hello, привет";     OutputStream os = new FileOutputStream("c:/temp/j.csv");     os.write(239);     os.write(187);     os.write(191);      PrintWriter w = new PrintWriter(new OutputStreamWriter(os, "UTF-8"));      w.print(line);     w.flush();     w.close(); 

And it worked! I opened the file using excel and saw text as I expected.

Bottom line: write these 3 bytes before writing the content. This prefix indicates that the content is in 'UTF-8 with BOM' (otherwise it is just 'UTF-8 without BOM').

like image 179
AlexR Avatar answered Sep 30 '22 14:09

AlexR


Unfortunately, CSV is a very ad hoc format with no metadata and no real standard that would mandate a flexible encoding. As long as you use CSV, you can't reliably use any characters outside of ASCII.

Your alternatives:

  • Write to XML (which does have encoding metadata if you do it right) and have the users import the XML into Excel.
  • Use Apache POI to create actual Excel documents.
like image 32
Michael Borgwardt Avatar answered Sep 30 '22 16:09

Michael Borgwardt