Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is 'Â' printed in front of '±' when code is run?

I'm trying to write a very simple output statement that is output into a csv file. It just states the deviation margin of data so I am using the '±' symbol so it will read something like '5 ft/s^2 ±2.4%'.

I'm working in Python3. I have tried three different methods of using the '±' symbol: ascii, unicode, and just straight up copy pasting the character into the editor. See below

val1 = 3.2
val2 = 2.4

s1 = val1 + "ft/sec^2 " + chr(241) + val2 + "%"
s2 = val1 + "ft/sec^2 " +  u'\u00B1' + val2 + "%"
s3 = val1 + "ft/sec^2 ±" + val2 + "%"

however the output for all three of those methods is always the same for me...

3.2ft/sec^2 ±2.4%

This 'Â' continues to show up. I am not at all experienced with encoding and things like that. I've searched and found a few situations that seemed to be related to mine but didn't understand enough to piece together a solution for my specific situation.

I'm using a pandas DataFrame to gather the data and then using the .to_csv() method to create the csv. The documentation for it states it defaults to 'utf-8' encoding.

Here's 7 lines that reproduce the same issue for me.

import pandas as pd 

df = pd.DataFrame(columns=['TestCol'])
df['TestCol'] = ["Test1: " + chr(241),
    "Test2: " + u'\u00B1',
    "Test3: " + "±"]
df.to_csv('TestExample.csv', index=False, encoding='utf-8')

In my CSV I get a column that looks like:

TestCol
Test1: ñ
Test2: ±
Test3: ±

Any help, explanation, and knowledge is appreciated!

like image 918
the_pied_shadow Avatar asked Dec 17 '22 16:12

the_pied_shadow


2 Answers

Excel assumes Windows encoding when opening a .csv file. This encoding is depending on the language/country, but in english and west europe countries it is cp-1252 and it is very similar to ISO-8859-1 (also known as "latin1").

This encoding uses a single byte per character. This means that it allows for 256 different characters at maximum (in fact, they are less than 256, because some of the codes are reserved for control and non-printable characters).

Python3 uses Unicode for representing the strings. Unicode does not have the limitation of "only 256" symbols, because internally it uses ~20 bits. In practice, Unicode can represent any character of any language in the world (and even some languages out of this world).

The problem is that, when Unicode has to be written into a file (or transferred via network), it has to be "encoded" as a sequence of bytes. One of the ways of doing this, and the current standard in many fields, is "UTF-8".

UTF-8 encoding uses a variable number of bytes per character. It was designed to be compatible with ASCII, so any symbol in the ASCII table is represented with a single byte (which coincides with its ascii code). But any character not in ascii will require more than 1 byte to be represented. In particular, character ± (codepoint U+00B1 or 177) when encoded in UTF-8, requires two bytes of hexadecimal values c2 and b1.

When Excel reads these bytes, since it assumes cp-1252 encoding which uses a single byte per character, it decodes the sequence c2, b1 as two separate characters. The first one is decoded as Â, and the second one, casually, as ±.

Note Incidentally, unicode ñ (codepoint U+00F1, or 241) is encoded in UTF-8 also as two bytes, of values c3, b1, which when decoded as cp-1252 are shown as ñ. Note that the first one is now à instead of Â, but the second one is again (casually again) ±.

The solution is to indicate to pandas that cp-1252 encoding should be used when writing the file:

df.to_csv("file.csv", encoding="cp1252")

Of course this has a potential problem. Since "cp-1252" can represent only 256 symbols at maximum, while Unicode can represent more than 1M symbols, it can happen that some of the string data in your dataframe uses any character not representable in "cp-1252". In this case you will get an encoding error.

Also, when reading back this .csv with Pandas, you have to specify the encoding, since Pandas assumes it is UTF-8.

Update about utf-8-sig

Other answers and some comments refer to "utf-8-sig" encoding, which would be another valid (perhaps preferable) solution. I'll elaborate a bit what is this.

UTF8 is not the only way to convert Unicode into a sequence of bytes, although it is the recommended one in several standards. Another popular choice is (was?) UTF-16. In this encoding, all Unicode characters are encoded as 16-bit values (some of them are not representable this way, but the set can be extended by using two 16-bit values for some characters).

The problem of using 16-bits per character, instead of 8-bits, is that then the endianess is relevant. Since 16 bits is not the basic unit in which memory, network and disk operate, when you write or send a 16 bit value to memory, network or disk, two bytes are in fact sent. And the order in which those bytes are sent is architecture-dependant. For example, assume that you need to write in disk the 16-bit number 66ff (expressed in hexadecimal). You have to "break" it into 66 and ff, and decide which one write first. The sequence in disk can be 66, ff (this is called big endian order), or ff, 66 (this is called little endian order).

If you are in a little-endian architecture, such as Intel, the default order of the bytes in the disk will be different than in a big-endian architecture. Of course the problem is when you try to read the file in a machine whose architecture is different than the one in which the file was created. You may end up assembly those bytes incorrectly as ff66, which would be a different Unicode character.

So it must be some way to include in the file information about the endianity used when it was created. This is the role of the so called BOM (Byte Order Mark). It consist in Unicode character FEFF. If this caracter is written as the first one in the file, when the file is read back, if your sofware finds FEFF as first character, it will know that the endianity used to read the file is the same than the one used when writting it. But if it finds FFFE instead (the order is swapped) it will know that there is a endianity mismatch, and then it will swap each pair of bytes while reading, to get the correct Unicode characters.

By the way, Unicode standard does not have a character whose code is FFFE, to avoid confusion when reading the BOM. If you find FFFE at the beginning, it means that the endianity is wrong and you must swap bytes.

None of this relates to UTF-8, since this encoding uses bytes (and not 16-bit) as basic unit of information, and thus it is immune to the endianity problem. Nevertheless, you can encode FEFF in UTF-8 (it will result in a sequence of 3 bytes, with values EF, BB, and BF) and write it anyway as the first character in the file. This is what Python does when you specify utf-8-sig encoding.

In this case its purpose is not to help determine endianity, but instead acting as a kind of "fingerprint" which helps the software which reads back the file to guess that the encoding used was UTF-8. If the sofware finds as the first 3 bytes in the file the "magic values" EF, BB, and BF, it can conclude that the file is stored in UTF-8. These three bytes are discarded, and the rest is decoded from UTF-8.

In particular, Microsoft Windows use this technique in most of its software. Apparently, in the case of Excel this also works, so, to summarize:

  • You write your csv using df.to_csv("file.csv", encoding="utf-8-sig")
  • Excel reads the file and finds EF, BB, BF at the start. So it discards these bytes and assumes utf-8 for the rest of the file.
  • When later the squence c2, b1 appears in the file, it is correctly decoded as UTF-8 to produce ±

This has the advantage of working in any Windows computer, no matter the code page it is using (cp1252 is for Western-Europe, other countries could be using other code pages, but Unicode and UTF-8 is universal).

The potential problem is if you try to read this csv in a non-windows machine. It can happen that the first "magic bytes" EF, BB, BF are meaningless for the software that reads it. Then, you may end with "spurious" characters at the beginning of the file, which may cause problem. If the software reading the file assumes UTF-8 encoding, these three first bytes will be decoded as Unicode character FFFE, but they will not be discarded. This character is invisible and has zero width, so it is not possible to "see" it with any editor, but it will be there nevertheless. If the software reading the file assumes any other encoding, such as "latin1", these three first bytes will be incorrectly decoded as , and they will be visible at the beginning of the file.

If you use python to read back this file, you have to specify utf-8-sig encoding again, to make python to discard those three initial bytes.

like image 85
JLDiaz Avatar answered Dec 20 '22 06:12

JLDiaz


You're writing UTF-8 to the file, but whatever you're using to view it is viewing it as latin-1 (or the similar Windows cp1252). You might try opening the file you're writing to with encoding='utf-8-sig' which puts a BOM at the beginning of the file so applications can recognize it as UTF-8. Or you might just tell your viewer program to interpret it as UTF-8. I strongly recommend not writing it as latin-1 or the like, as that will make the text non-portable to systems with other locales without explicitly telling people how to decode it.

like image 29
ShadowRanger Avatar answered Dec 20 '22 05:12

ShadowRanger