I made a PHP script that generates CSV files that were previously generated by another process. And then, the CSV files have to be imported by yet another process. The import of the old CSV files works fine, but but when importing the new CSV files there are issues with special characters. When I open old CSVs with Notepad++, it says the encoding is UTF-8, and when I open the new CSVs with it, it says their encoding is 'ANSI as UTF-8'. What's the difference of the two? And how can I make fopen and fputcsv use the 'pure?' UTF-8 encoding? Thanks!

There's nothing wrong with the file. "ANSI as UTF-8" means there's no BOM but Notepad++ has definitely identified the encoding as UTF-8 by analyzing byte patterns. I tested this by creating a file with Russian, Greek and Polish text in it and saving it as UTF-8 without a BOM. Here it is: <pre class="prettyprint"><code># Russian Следующая # Greek Επόμενη # Polish Więcej </code></pre> I did this in a different editor (EditPad Pro) and used hex mode to make sure the BOM wasn't there. When I opened it in NPP it showed the encoding as "ANSI as UTF-8" and all of the characters displayed correctly. Then, still in hex mode, I removed the first byte of the first Russian character. When I opened it in NPP again, it showed the encoding as "ANSI" and displayed the non-ASCII parts of the text as mojibake: <pre class="prettyprint"><code>; Russian ¡Ð»ÐµÐ´Ñ&fnof;ÑŽÑ&permil;Ð°Ñ ; Greek Î•Ï€Ï&OElig;Î¼ÎµÎ½Î· ; Polish WiÄ™cej </code></pre> Back to EditPad, and this time I added a BOM but didn't repair the Cyrillic character. This time NPP reported the encoding as "UTF-8" and everything displayed correctly except that first Russian character, as shown below. "A1" is the hex representation of what should have been the second byte of that character in UTF-8. It was displayed in an inverted color scheme to indicate an error. <pre class="prettyprint"><code># Russian A1ледующая # Greek Επόμενη # Polish Więcej </code></pre> To summarize: In the absence of a BOM, Notepad++ looks for bytes that can't represent ASCII characters because their values are greater than 127 (or <code>7F</code> hex). If it finds any, but they all conform to the patterns required by UTF-8, it decodes the file as UTF-8 and reports the encoding in the status bar as "ANSI as UTF-8". But if it finds even one byte that doesn't toe the UTF-8 line, it decodes the file as "ANSI", meaning the default single-byte encoding for the underlying platform. If your file had been corrupted, that's what you would be seeing. EDIT: Although your file is valid without it, you could add a BOM by manually writing the three bytes <code>"EF BB BF"</code> at the very beginning of the file--but there should be a better way. How are you generating the content now? Because it is UTF-8, with at least one non-ASCII character in there somewhere; otherwise, NPP would report it as "ANSI". Another possibility to consider: if you have any influence over the process that consumes your CSV file, maybe you can configure it to expect UTF-8 without a BOM. Technically, any software that can decode UTF-8 with a BOM but not without one is broken. The Unicode Consortium actually discourages use of the UTF-8 BOM, not that anyone's listening.

According to the Notepad++ related threads here and here, 'ANSI as UTF-8' indicates UTF-8 without BOM, while a plain 'UTF-8' means UTF-8 with BOM. So maybe the process reading the CSV needs the Byte-order mark to correctly read the CSV as UTF-8. But before going into that, make sure that your script actually writes UTF-8! When you open the new CSVs in Notepad++ (and it says 'ANSI as UTF-8'), are all 'special' characters displayed correctly? If not, you need to adapt your script to actually write UTF-8, if yes, check for the BOM difference.

What is "ANSI as UTF-8" and how can I make fputcsv() generate UTF-8 w/BOM?

2 Answers

There's nothing wrong with the file. "ANSI as UTF-8" means there's no BOM but Notepad++ has definitely identified the encoding as UTF-8 by analyzing byte patterns. I tested this by creating a file with Russian, Greek and Polish text in it and saving it as UTF-8 without a BOM. Here it is:

# Russian
Следующая

# Greek
Επόμενη

# Polish
Więcej

I did this in a different editor (EditPad Pro) and used hex mode to make sure the BOM wasn't there. When I opened it in NPP it showed the encoding as "ANSI as UTF-8" and all of the characters displayed correctly. Then, still in hex mode, I removed the first byte of the first Russian character. When I opened it in NPP again, it showed the encoding as "ANSI" and displayed the non-ASCII parts of the text as mojibake:

; Russian
¡Ð»ÐµÐ´ÑƒÑŽÑ‰Ð°Ñ

; Greek
Î•Ï€ÏŒÎ¼ÎµÎ½Î·

; Polish
WiÄ™cej

Back to EditPad, and this time I added a BOM but didn't repair the Cyrillic character. This time NPP reported the encoding as "UTF-8" and everything displayed correctly except that first Russian character, as shown below. "A1" is the hex representation of what should have been the second byte of that character in UTF-8. It was displayed in an inverted color scheme to indicate an error.

# Russian
A1ледующая

# Greek
Επόμενη

# Polish
Więcej

To summarize: In the absence of a BOM, Notepad++ looks for bytes that can't represent ASCII characters because their values are greater than 127 (or 7F hex). If it finds any, but they all conform to the patterns required by UTF-8, it decodes the file as UTF-8 and reports the encoding in the status bar as "ANSI as UTF-8".

But if it finds even one byte that doesn't toe the UTF-8 line, it decodes the file as "ANSI", meaning the default single-byte encoding for the underlying platform. If your file had been corrupted, that's what you would be seeing.

EDIT: Although your file is valid without it, you could add a BOM by manually writing the three bytes "EF BB BF" at the very beginning of the file--but there should be a better way. How are you generating the content now? Because it is UTF-8, with at least one non-ASCII character in there somewhere; otherwise, NPP would report it as "ANSI".

Another possibility to consider: if you have any influence over the process that consumes your CSV file, maybe you can configure it to expect UTF-8 without a BOM. Technically, any software that can decode UTF-8 with a BOM but not without one is broken. The Unicode Consortium actually discourages use of the UTF-8 BOM, not that anyone's listening.

answered Oct 11 '22 22:10

Alan Moore

According to the Notepad++ related threads here and here, 'ANSI as UTF-8' indicates UTF-8 without BOM, while a plain 'UTF-8' means UTF-8 with BOM. So maybe the process reading the CSV needs the Byte-order mark to correctly read the CSV as UTF-8.

But before going into that, make sure that your script actually writes UTF-8! When you open the new CSVs in Notepad++ (and it says 'ANSI as UTF-8'), are all 'special' characters displayed correctly? If not, you need to adapt your script to actually write UTF-8, if yes, check for the BOM difference.

answered Oct 11 '22 23:10

Henrik Opel

Related questions
                            
                                how to go to the same page after login in PHP
                            
                                PHP if not statements
                            
                                correct name for a variable users_ids vs user_ids
                            
                                PHP "Exception not found"
                            
                                How to use css style in php
                            
                                Failed user login on production server using Symfony framework (Authentication request could not be processed due to...)
                            
                                PHP on Windows with XAMPP running 100 times too slow
                            
                                Sending email to multiple cc recipients in Laravel 5.4
                            
                                Capturing linebreaks (newline,linefeed) characters in a textarea
                            
                                PHP readdir() not returning files in alphabetical order
                            
                                MySQL date or PHP time?
                            
                                Calling a function before it's defined | PHP
                            
                                How to validate a Twitter username using Regex
                            
                                Curly Braces Notation in PHP
                            
                                Design Patterns: How to create database object/connection only when needed?
                            
                                How to reset auto increment in laravel user deletion?
                            
                                Laravel 5: redirect to an external link outside of localhost/server
                            
                                HTML PHP google single sign on signout will throw "Cannot read property 'getAuthInstance' of undefined"
                            
                                How to Disable Selected Middleware in Laravel Tests
                            
                                Could not find package laravel-laravel with stability stable

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is "ANSI as UTF-8" and how can I make fputcsv() generate UTF-8 w/BOM?

Tags:

php

character-encoding

notepad++

utf-8

Petruza

People also ask

2 Answers

Alan Moore

Henrik Opel

Recent Activity

Donate For Us