Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are XLSX files UTF-8 encoded by definition?

I'm trying to read in XLSX files with PHP. Using gneustaetter/XLSXReader to be exact. However, these XLSX-files are generated by different companies, using different software. So I wanted to check if they have the right encoding and always just found UTF-8.

Therefore my question as above: Are XLSX files UTF-8 encoded by definition? Or are there exceptions that could break the import script I'm working on?

like image 888
Marco Avatar asked Jul 19 '17 15:07

Marco


People also ask

What encoding is XLSX file?

The work around would be to use a data process shape to re-encode the XLSX file to it original encoding it (its usually encoded in ISO-8859-1).

How can I tell the encoding of an XLSX file?

Go to the Data tab and select “From Text”. The text import wizard dialogue will open. With current encoding, the file content is unreadable. In the “File origin” field, we look through the encodings until we find the one where the text is displayed correctly.

What kind of encoding does Excel use?

From memory, Excel uses the machine-specific ANSI encoding. So this would be Windows-1252 for a EN-US installation, 1251 for Russian, etc.

Does Excel use UTF-8?

As older Excel versions do not support the UTF-8 encoding, you'll need to save your document in the Unicode Text format first, and then convert it to UTF-8. To export an Excel file to CSV and preserve special characters, follow these steps: In your worksheet, click File > Save As or press F12.


1 Answers

It'd be risky to presume it's always UTF-8. I'd just key your expectations to what the XML describes in the XML header. In my experience Windows-1252 encoded data shows up all the time when you least expect it. You might check the XLSX specification more closely to find out more.

Here's a Chromium bug relating to a Windows-1252 encoded XLSX file, so these seem to exist in the wild. Maybe they're produced by programs other than Microsoft Office. With things like LibreOffice becoming more popular, older versions that may not have had the most robust XLSX support might end up interacting with your code. You probably don't want to have a bug like this show up in your code.

Try and be as accommodating as possible unless you have a concrete reason for rejecting invalid encoding. JSON, by strict definition, is UTF-8. XLSX seems to be XML by definition, but the encoding is not as nailed down. UTF-8 simply seems to be the default convention.

like image 151
tadman Avatar answered Oct 06 '22 19:10

tadman