How to parse UTF-8 characters in Excel files using POI

Tags:

I have been using POI to parse XLS and XLSX files successfully. However, I am unable to correctly extract special characters, such as UTF-8 encoded characters like Chinese or Japanese, from an Excel spreadsheet. I have figured out how to extract data from a UTF-8 encoded csv or tab delimited file, but no luck with the Excel file. Can anyone help?

(Edit: Code snippet from comments)

HSSFSheet sheet = workbook.getSheet(worksheet); 
HSSFEvaluationWorkbook ewb = HSSFEvaluationWorkbook.create(workbook); 
while (rowCtr <= lastRow && !rowBreakOut) 
{ 
    Row row = sheet.getRow(rowCtr);//rows.next(); 
    for (int col=firstCell; col<lastCell && !breakOut; col++) { 
      Cell cell; 
      cell = row.getCell(col,Row.RETURN_BLANK_AS_NULL); 
      if (ctype == Cell.CELL_TYPE_STRING) { 
         sValue = cell.getStringCellValue(); 
         log.warn("String value = "+sValue); 
         String encoded = URLEncoder.encode(sValue, "UTF-8"); 
         log.warn("URL-encoded with UTF-8: " + encoded); 
         ....

412

asked Feb 08 '12 22:02

user1198370

4 Answers

I had the same problem while extracting Persian text from an Excel file. I was using Eclipse, and simply going to Project -> Properties and changing the "text file encoding" to UTF-8 solved the problem.

answered Oct 05 '22 06:10

Roozbehan

in POI you can use like this:

Workbook wb = new HSSFWorkbook();
Sheet sheet = wb.createSheet("new sheet");

// Create a row and put some cells in it. Rows are 0 based.
Row row = sheet.createRow(1);

// Create a new font and alter it.
Font font = wb.createFont();
font.setCharSet(FontCharset.ARABIC.getValue());
font.setFontHeightInPoints((short)24);
font.setFontName("B Nazanin");
font.setItalic(true);
font.setStrikeout(true);

// Fonts are set into a style so create a new one to use.
CellStyle style = wb.createCellStyle();
style.setFont(font);

// Create a cell and put a value in it.
Cell cell = row.createCell(1);
cell.setCellValue("سلام");
cell.setCellStyle(style);

// Write the output to a file
FileOutputStream fileOut = new FileOutputStream("workbook.xls");
wb.write(fileOut);
fileOut.close();

and can use another charset in FontCharset

answered Oct 05 '22 08:10

oveis beheshti

Get bytes using UTF as follows

cell.getStringCellValue().getBytes(Charset.forName("UTF-8"));

answered Oct 05 '22 07:10

yottabrain

The solution is simple, to read cell string values of any encoding (non English characters); just use the following method:

sValue = cell.getRichStringCellValue().getString();

instead of:

sValue = cell.getStringCellValue();

This applies to UTF-8 encoded characters like Chinese, Arabic or Japanese.

P.S if anybody is using the Command line utility nullpunkt/excel-to-json which utilize the "Apache POI" library, modify the file converter/ExcelToJsonConverter.java by replacing the occurrences of "getStringCellValue()" to avoid reading non-english characters as "???".

answered Oct 05 '22 08:10

Yacoub Oweis

Related questions
                            
                                What does Map<?, ?> mean in Java?
                            
                                Sending messages between two JPanel objects
                            
                                Syntax error on token "Invalid Character", delete this token
                            
                                Remove all blank spaces and empty lines
                            
                                Java mockito mock set
                            
                                Returning validation errors as JSON with Play! framework
                            
                                Java Class Dynamically with Constructor parameter
                            
                                Java raw audio output
                            
                                Access is denied while compiling Java on Windows
                            
                                Java list of waiting threads
                            
                                How to Get class of Elements that List Contains?
                            
                                How to import org.apache.commons.net.ftp.FTPClient
                            
                                How to close rmiregistry running on particular port?
                            
                                when using a array of strings in java, convert it to lowercase
                            
                                getPageSource() in Selenium WebDriver(a.k.a Selenium2) using Java
                            
                                Eclipse > Java > open linked resources sources (.java instead of .class )?
                            
                                What is List<?> in Java (Android)? [duplicate]
                            
                                Getting rid of `instanceof`
                            
                                java rotate rectangle around the center
                            
                                Conversion from ArrayList to Collection

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to parse UTF-8 characters in Excel files using POI

Tags:

java

excel

utf-8

cjk

apache-poi