Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does  appear in my data?

I downloaded the file 'pi_million_digits.txt' from here:

https://github.com/ehmatthes/pcc/blob/master/chapter_10/pi_million_digits.txt

I then used this code to open and read it:

filename = 'pi_million_digits.txt'

with open(filename) as file_object:
    lines = file_object.readlines()

pi_string = ''
for line in lines:
    pi_string += line.strip()

print(pi_string[:52] + "...")
print(len(pi_string))

However the output produced is correct apart from the fact it is preceded by same strange symbols: "3.141...."

What causes these strange symbols? I am stripping the lines so I'd expect such symbols to be removed.

like image 489
Bazman Avatar asked May 18 '17 16:05

Bazman


1 Answers

It looks like you're opening a file with a UTF-8 encoded Byte Order Mark using the ISO-8859-1 encoding (presumably because this is the default encoding on your OS).

If you open it as bytes and read the first line, you should see something like this:

>>> next(open('pi_million_digits.txt', 'rb'))
b'\xef\xbb\xbf3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'

… where \xef\xbb\xbf is the UTF-8 encoding of the BOM. Opened as ISO-8859-1, it looks like what you're getting:

>>> next(open('pi_million_digits.txt', encoding='iso-8859-1'))
'3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'

… and opening it as UTF-8 shows the actual BOM character U+FEFF:

>>> next(open('pi_million_digits.txt', encoding='utf-8'))
'\ufeff3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'

To strip the mark out, use the special encoding utf-8-sig:

>>> next(open('pi_million_digits.txt', encoding='utf-8-sig'))
'3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'

The use of next() in the examples above is just for demonstration purposes. In your code, you just need to add the encoding argument to your open() line, e.g.

with open(filename, encoding='utf-8-sig') as file_object:
    # ... etc.
like image 168
Zero Piraeus Avatar answered Sep 30 '22 04:09

Zero Piraeus